首页 | 官方网站   微博 | 高级检索  
     

基于欠采样和代价敏感的不平衡数据分类算法
引用本文:王俊红,闫家荣.基于欠采样和代价敏感的不平衡数据分类算法[J].计算机应用,2021,41(1):48-52.
作者姓名:王俊红  闫家荣
作者单位:1. 山西大学 计算机与信息技术学院, 太原 030006;2. 计算智能与中文信息处理教育部重点实验室(山西大学), 太原 030006
基金项目:山西省自然科学基金资助项目;国家自然科学基金资助项目
摘    要:针对不平衡数据集中的少数类在传统分类器上预测精度低的问题,提出了一种基于欠采样和代价敏感的不平衡数据分类算法——USCBoost。首先在AdaBoost算法每次迭代训练基分类器之前对多数类样本按权重由大到小进行排序,根据样本权重选取与少数类样本数量相当的多数类样本;之后将采样后的多数类样本权重归一化并与少数类样本组成临时训练集训练基分类器;其次在权重更新阶段,赋予少数类更高的误分代价,使得少数类样本权重增加更快,并且多数类样本权重增加更慢。在10组UCI数据集上,将USCBoost与AdaBoost、AdaCost、RUSBoost进行对比实验。实验结果表明USCBoost在F1-measure和G-mean准则下分别在6组和9组数据集获得了最高的评价指标。可见所提算法在不平衡数据上具有更好的分类性能。

关 键 词:不平衡数据  分类  代价敏感  AdaBoost算法  欠采样  
收稿时间:2020-05-31
修稿时间:2020-07-22

Classification algorithm based on undersampling and cost-sensitiveness for unbalanced data
WANG Junhong,YAN Jiarong.Classification algorithm based on undersampling and cost-sensitiveness for unbalanced data[J].journal of Computer Applications,2021,41(1):48-52.
Authors:WANG Junhong  YAN Jiarong
Affiliation:1. School of Computer and Information Technology, Shanxi University, Taiyuan Shanxi 030006, China;2. Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education(Shanxi University), Taiyuan Shanxi 030006, China
Abstract:Focusing on the problem that the minority class in the unbalanced dataset has low prediction accuracy by traditional classifiers,an unbalanced data classification algorithm based on undersampling and cost-sensitiveness,called USCBoost(UnderSamples and Cost-sensitive Boosting),was proposed.Firstly,the majority class samples were sorted from large weight sample to small weight sample before base classifiers being trained by the AdaBoost(Adaptive Boosting)algorithm in each iteration,the majority class samples with the number equal to the number of minority class samples were selected according to sample weights,and the weights of majority class samples after sampling were normalized and a temporary training set was formed by these majority class samples and the minority class samples to train base classifiers.Secondly,in the weight update stage,higher misclassification cost was given to the minority class,which made the weights of minority class samples increase faster and the weights of majority class samples increase more slowly.On ten sets of UCI datasets,USCBoost was compared with AdaBoost,AdaCost(Cost-sensitive AdaBoosting),and RUSBoost(Random Under-Sampling Boosting).Experimental results show that USCBoost has the highest evaluation indexes on six sets and nine sets of datasets under the F1-measure and G-mean criteria respectively.The proposed algorithm has better classification performance on unbalanced data.
Keywords:unbalanced data  classification  cost-sensitiveness  AdaBoost algorithm  undersampling
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号