首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The problem of modeling binary responses by using cross-sectional data has been addressed with a number of satisfying solutions that draw on both parametric and nonparametric methods. However, there exist many real situations where one of the two responses (usually the most interesting for the analysis) is rare. It has been largely reported that this class imbalance heavily compromises the process of learning, because the model tends to focus on the prevalent class and to ignore the rare events. However, not only the estimation of the classification model is affected by a skewed distribution of the classes, but also the evaluation of its accuracy is jeopardized, because the scarcity of data leads to poor estimates of the model’s accuracy. In this work, the effects of class imbalance on model training and model assessing are discussed. Moreover, a unified and systematic framework for dealing with the problem of imbalanced classification is proposed, based on a smoothed bootstrap re-sampling technique. The proposed technique is founded on a sound theoretical basis and an extensive empirical study shows that it outperforms the main other remedies to face imbalanced learning problems.  相似文献   

2.
This study investigates how to alleviate the class imbalance problems for constructing unbiased classifiers when instances in one class are more than that in another. Since keeping the data distribution unchanged and expanding class boundaries after synthetic samples have been added influence the classification performance greatly, we take into account the above two factors, and propose a Random Walk Over-Sampling approach (RWO-Sampling) to balancing different class samples by creating synthetic samples through randomly walking from the real data. When some conditions are satisfied, it can be proved that, both the expected average and the standard deviation of the generated samples equal to that of the original minority class data. RWO-Sampling also expands the minority class boundary after synthetic samples have been generated. In this work, we perform a broad experimental evaluation, and experimental results show that, RWO-Sampling statistically does much better than alternative methods on imbalanced data sets when implementing common baseline algorithms.  相似文献   

3.
More than two decades ago the imbalanced data problem turned out to be one of the most important and challenging problems. Indeed, missing information about the minority class leads to a significant degradation in classifier performance. Moreover, comprehensive research has proved that there are certain factors increasing the problem’s complexity. These additional difficulties are closely related to the data distribution over decision classes. In spite of numerous methods which have been proposed, the flexibility of existing solutions needs further improvement. Therefore, we offer a novel rough–granular computing approach (RGA, in short) to address the mentioned issues. New synthetic examples are generated only in specific regions of feature space. This selective oversampling approach is applied to reduce the number of misclassified minority class examples. A strategy relevant for a given problem is obtained by formation of information granules and an analysis of their degrees of inclusion in the minority class. Potential inconsistencies are eliminated by applying an editing phase based on a similarity relation. The most significant algorithm parameters are tuned in an iterative process. The set of evaluated parameters includes the number of nearest neighbours, complexity threshold, distance threshold and cardinality redundancy. Each data model is built by exploiting different parameters’ values. The results obtained by the experimental study on different datasets from the UCI repository are presented. They prove that the proposed method of inducing the neighbourhoods of examples is crucial in the proper creation of synthetic positive instances. The proposed algorithm outperforms related methods in most of the tested datasets. The set of valid parameters for the Rough–Granular Approach (RGA) technique is established.  相似文献   

4.
5.
针对非平衡数据分类问题,提出了一种改进的SVM-KNN分类算法,在此基础上设计了一种集成学习模型.该模型采用限数采样方法对多数类样本进行分割,将分割后的多数类子簇与少数类样本重新组合,利用改进的SVM-KNN分别训练,得到多个基本分类器,对各个基本分类器进行组合.采用该模型对UCI数据集进行实验,结果显示该模型对于非平衡数据分类有较好的效果.  相似文献   

6.
Qiao  Shaojie  Han  Nan  Huang  Faliang  Yue  Kun  Wu  Tao  Yi  Yugen  Mao  Rui  Yuan  Chang-an 《Applied Intelligence》2022,52(7):7870-7889
Applied Intelligence - In the real-world applications of machine learning and cybernetics, the data with imbalanced distribution of classes or skewed class proportions is very pervasive. When...  相似文献   

7.
8.
Learning from imbalanced data is a challenging task in a wide range of applications, which attracts significant research efforts from machine learning and data mining community. As a natural approach to this issue, oversampling balances the training samples through replicating existing samples or synthesizing new samples. In general, synthesization outperforms replication by supplying additional information on the minority class. However, the additional information needs to follow the same normal distribution of the training set, which further constrains the new samples within the predefined range of training set. In this paper, we present the Wiener process oversampling (WPO) technique that brings the physics phenomena into sample synthesization. WPO constructs a robust decision region by expanding the attribute ranges in training set while keeping the same normal distribution. The satisfactory performance of WPO can be achieved with much lower computing complexity. In addition, by integrating WPO with ensemble learning, the WPOBoost algorithm outperformsmany prevalent imbalance learning solutions.  相似文献   

9.
《微型机与应用》2015,(23):7-10
针对传统的机器学习算法对不平衡数据集的少类分类准确率不高的问题,基于支持向量机和模糊聚类,提出一种不平衡数据加权集成学习算法。首先提出加权支持向量机模型(Weighted Support Vector Machine,WSVM),该模型根据不同类别数据所占比例的不同,为各类别分配不同的权重,然后将WSVM与模糊聚类结合提出一种新的集成学习算法。将本文提出的算法应用于人造数据集和UCI数据集实验中,实验结果表明,所提出的算法能够有效地解决不平衡数据的分类问题,具有更好的分类性能。  相似文献   

10.
One of the most widely used approaches to the class-imbalanced issue is ensemble learning. The base classifier is trained using an unbalanced training set in the conventional ensemble learning approach. We are unable to select the best suitable resampling method or base classifier for the training set, despite the fact that researchers have examined employing resampling strategies to balance the training set. A multi-armed bandit heterogeneous ensemble framework was developed as a solution to these issues. This framework employs the multi-armed bandit technique to pick the best base classifier and resampling techniques to build a heterogeneous ensemble model. To obtain training sets, we first employ the bagging technique. Then, we use the instances from the out-of-bag set as the validation set. In general, we consider the basic classifier combination with the highest validation set score to be the best model on the bagging subset and add it to the pool of model. The classification performance of the multi-armed bandit heterogeneous ensemble model is then assessed using 30 real-world imbalanced data sets that were gathered from UCI, KEEL, and HDDT. The experimental results demonstrate that, under the two assessment metrics of AUC and Kappa, the proposed heterogeneous ensemble model performs competitively with other nine state-of-the-art ensemble learning methods. At the same time, the findings of the experiment are confirmed by the statistical findings of the Friedman test and Holm's post-hoc test.  相似文献   

11.
毛文涛  田杨阳  王金婉  何玲 《控制与决策》2016,31(12):2147-2154
针对现有算法对贯序到达的密度型不均衡数据分类效果不佳的缺陷, 提出一种基于粒度划分的在线贯序极限学习机算法. 离线阶段,根据数据分布特性对多类样本进行粒度划分, 用粒心代替原有样本, 建立初始模型; 在线阶段, 根据更新后的分布特性对多类边界数据进行二次粒度划分, 替换原有边界数据, 并动态更新网络权值. 理论分析证明该算法存在信息损失上界. 实验结果表明, 该算法能有效提高贯序不均衡数据上的整体泛化性能和分类效率.  相似文献   

12.
The storage and labeling of industrial data incur significant costs during the development of defect detection algorithms. Active learning solves these problems by selecting the most informative data among the given unlabeled data. The existing active learning methods for image segmentation focus on studying natural images and medical images, with less attention given to industrial images, and little research has been performed on imbalanced data. To solve these problems, we propose an active learning framework to selecting informative data for defect segmentation under imbalanced data. In the initialization stage, the framework uses self-supervised learning to initialize the data so that the initialization data contain more defect data, thereby solving the cold-start problem. During the iterative stage, we design the main body of the active learning framework, which is composed of a segmentation learner and a reconstruction learner. These learners use supervised learning to further improve the framework’s ability to select informative data. The experimental results obtained on public and self-owned datasets show that the framework can save 70% of the required storage space and greatly reduce the cost of labeling. The intersection over union value proves that the designed framework can achieve the equivalent effect of labeling the whole dataset by labeling partial data.  相似文献   

13.
Classification is an important task in data mining. Class imbalance has been reported to hinder the performance of standard classification models. However, our study shows that class imbalance may not be the only cause to blame for poor performance. Rather, the underlying complexity of the problem may play a more fundamental role. In this paper, a decision tree method based on Kolmogorov-Smirnov statistic (K-S tree), is proposed to segment the training data so that a complex problem can be divided into several easier sub-problems where class imbalance becomes less challenging. K-S tree is also used to perform feature selection, which not only selects relevant variables but also removes redundant ones. After segmentation, a two-way re-sampling method is used at the segment level to empirically determine the optimal sampling percentage and the rebalanced data is used to fit logistic regression models, also at the segment level. The effectiveness of the proposed method is demonstrated through its application on property refinance prediction.  相似文献   

14.
Pattern Analysis and Applications - Imbalanced learning is one of the substantial challenging problems in the field of data mining. The datasets that have skewed class distribution pose hindrance...  相似文献   

15.
左鹏玉  周洁  王士同   《智能系统学报》2020,15(3):520-527
针对在线序列极限学习机对于类别不平衡数据的学习效率低、分类准确率差的问题,提出了面对类别不平衡的增量在线序列极限学习机(IOS-ELM)。该算法根据类别不平衡比例调整平衡因子,利用分块矩阵的广义逆矩阵对隐含层节点数进行寻优,提高了模型对类别不平衡数据的在线处理能力,最后通过14个二类和多类不平衡数据集对该算法有效性和可行性进行验证。实验结果表明:该算法与同类其他算法相比具有更好的泛化性和准确率,适用于类别不平衡场景下的在线学习。  相似文献   

16.
针对现有机器学习算法难以有效提高贯序不均衡数据分类问题中少类样本分类精度的问题,提出一种基于混合采样策略的在线贯序极限学习机。该算法可在提高少类样本分类精度的前提下,减少多类样本的分类精度损失,主要包括离线和在线两个阶段:离线阶段采用均衡采样策略,利用主曲线分别构建多类和少类样本的可信区域,在不改变样本分布特性的前提下,利用可信区域扩充少类样本和削减多类样本,进而得到均衡的离线样本集,建立初始模型;在线阶段仅对贯序到达的多类数据进行欠采样,根据样本重要度挑选最具价值的多类样本,进而动态更新网络权值。通过理论分析证明所提算法在理论上存在损失信息上界。采用UCI标准数据集和实际的澳门空气污染预报数据进行仿真实验,结果表明,与现有在线贯序极限学习机(OS-ELM)、极限学习机(ELM)和元认知在线贯序极限学习机(MCOS-ELM)算法相比,所提算法对少类样本的预测精度更高,且数值稳定性良好。  相似文献   

17.
18.
Since the overall prediction error of a classifier on imbalanced problems can be potentially misleading and biased, alternative performance measures such as G-mean and F-measure have been widely adopted. Various techniques including sampling and cost sensitive learning are often employed to improve the performance of classifiers in such situations. However, the training process of classifiers is still largely driven by traditional error based objective functions. As a result, there is clearly a gap between themeasure according to which the classifier is evaluated and how the classifier is trained. This paper investigates the prospect of explicitly using the appropriate measure itself to search the hypothesis space to bridge this gap. In the case studies, a standard threelayer neural network is used as the classifier, which is evolved by genetic algorithms (GAs) with G-mean as the objective function. Experimental results on eight benchmark problems show that the proposed method can achieve consistently favorable outcomes in comparison with a commonly used sampling technique. The effectiveness of multi-objective optimization in handling imbalanced problems is also demonstrated.  相似文献   

19.
Recently, the class imbalance problem has attracted much attention from researchers in the field of data mining. When learning from imbalanced data in which most examples are labeled as one class and only few belong to another class, traditional data mining approaches do not have a good ability to predict the crucial minority instances. Unfortunately, many real world data sets like health examination, inspection, credit fraud detection, spam identification and text mining all are faced with this situation. In this study, we present a novel model called the “Information Granulation Based Data Mining Approach” to tackle this problem. The proposed methodology, which imitates the human ability to process information, acquires knowledge from Information Granules rather then from numerical data. This method also introduces a Latent Semantic Indexing based feature extraction tool by using Singular Value Decomposition, to dramatically reduce the data dimensions. In addition, several data sets from the UCI Machine Learning Repository are employed to demonstrate the effectiveness of our method. Experimental results show that our method can significantly increase the ability of classifying imbalanced data.  相似文献   

20.
不平衡数据分类是机器学习研究的热点问题,传统分类算法假定不同类别分布大致平衡,并不符合天气数据实际情况。提出基于修改的代价敏感学习的方法对不平衡的天气数据进行预处理,结合天气数据自身的特点,以单位时间的降雨量为成本的值,将数据合理有效地区分为下雨与非下雨两类。进而运用基于逻辑的方法对处理完的数据进行分析,运用分支限界算法得出布尔分类器。实验结果表明此方法可行有效。该方法可进一步对布尔分类器结果进行逻辑运算,从而达到更加灵活地操作分类器的效果。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号