首页 | 官方网站   微博 | 高级检索  
     

基于样本密度峰值的不平衡数据欠抽样方法
引用本文:苏俊宁,叶东毅.基于样本密度峰值的不平衡数据欠抽样方法[J].计算机应用,2020,40(1):83-89.
作者姓名:苏俊宁  叶东毅
作者单位:福州大学 数学与计算机科学学院, 福州 350108
基金项目:国家自然科学基金资助项目(61672158);福建省高校产学合作项目(2018H6010)。
摘    要:不平衡数据分类是数据挖掘和机器学习领域的一个重要问题,其中数据重抽样方法是影响分类准确率的一个重要因素。针对现有不平衡数据欠抽样方法不能很好地保持抽样样本与原有样本的分布一致的问题,提出一种基于样本密度峰值的不平衡数据欠抽样方法。首先,应用密度峰值聚类算法估计多数类样本聚成的不同类簇的中心区域和边界区域,进而根据样本所处类簇区域的局部密度和不同密度峰值的分布信息计算样本权重;然后,按照权重大小对多数类样本点进行欠抽样,使所抽取的多数类样本尽可能由类簇中心区域向边界区域逐步减少,在较好地反映原始数据分布的同时又可抑制噪声;最后,将抽取到的多数类样本与所有的少数类样本构成平衡数据集用于分类器的训练。多个数据集上的实验结果表明,与现有的RBBag、uNBBag和KAcBag等欠抽样方法相比,所提方法在F1-measure和G-mean指标上均取得一定的提升,是有效、可行的样本抽样方法。

关 键 词:不平衡数据  密度峰值  样本权重  欠抽样  集成学习  
收稿时间:2019-06-10
修稿时间:2019-07-23

Under-sampling method based on sample density peaks for imbalanced data
SU Junning,YE Dongyi.Under-sampling method based on sample density peaks for imbalanced data[J].journal of Computer Applications,2020,40(1):83-89.
Authors:SU Junning  YE Dongyi
Affiliation:College of Mathematics and Computer Science, Fuzhou University, Fuzhou Fujian 350108, China
Abstract:Imbalanced data classification is an important problem in data mining and machine learning. The way of re-sampling of data is crucial to the accuracy of classification. Concerning the problem that the existing under-sampling methods for imbalanced data cannot keep the distribution of sampling samples in good agreement with that of original samples, an under-sampling method based on sample density peaks was proposed. Firstly, the density peak clustering algorithm was applied to cluster samples of majority class and to estimate the central and boundary regions of different clusters obtained, so that each sample weight was determined according to the local density and different density peak distribution of cluster region where the sample was in. Then, the samples of majority class were under-sampled based on weights, so that the population of extracted majority class samples was gradually reduced from central region to boundary region of its cluster. In this way, the extracted samples would well reflect original sample distribution while suppressing the noise. Finally, a balanced data set was constructed by the sampled majority samples and all minority samples for the classifier training. The experimental results on multiple datasets show that the proposed sampling method has the F1-measure and G-mean improved, compared with some existing methods such as RBBag (Roughly Balanced Bagging), uNBBag (under-sampling NeighBorhood Bagging), KAcBag (K-means AdaCost bagging), proving that the proposed method is an effective and feasible sampling method.
Keywords:imbalanced data                                                                                                                        density peak                                                                                                                        sample weight                                                                                                                        under-sampling                                                                                                                        ensemble learning
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号