共查询到20条相似文献,搜索用时 15 毫秒
1.
The problem of modeling binary responses by using cross-sectional data has been addressed with a number of satisfying solutions that draw on both parametric and nonparametric methods. However, there exist many real situations where one of the two responses (usually the most interesting for the analysis) is rare. It has been largely reported that this class imbalance heavily compromises the process of learning, because the model tends to focus on the prevalent class and to ignore the rare events. However, not only the estimation of the classification model is affected by a skewed distribution of the classes, but also the evaluation of its accuracy is jeopardized, because the scarcity of data leads to poor estimates of the model’s accuracy. In this work, the effects of class imbalance on model training and model assessing are discussed. Moreover, a unified and systematic framework for dealing with the problem of imbalanced classification is proposed, based on a smoothed bootstrap re-sampling technique. The proposed technique is founded on a sound theoretical basis and an extensive empirical study shows that it outperforms the main other remedies to face imbalanced learning problems. 相似文献
2.
This study investigates how to alleviate the class imbalance problems for constructing unbiased classifiers when instances in one class are more than that in another. Since keeping the data distribution unchanged and expanding class boundaries after synthetic samples have been added influence the classification performance greatly, we take into account the above two factors, and propose a Random Walk Over-Sampling approach (RWO-Sampling) to balancing different class samples by creating synthetic samples through randomly walking from the real data. When some conditions are satisfied, it can be proved that, both the expected average and the standard deviation of the generated samples equal to that of the original minority class data. RWO-Sampling also expands the minority class boundary after synthetic samples have been generated. In this work, we perform a broad experimental evaluation, and experimental results show that, RWO-Sampling statistically does much better than alternative methods on imbalanced data sets when implementing common baseline algorithms. 相似文献
3.
More than two decades ago the imbalanced data problem turned out to be one of the most important and challenging problems. Indeed, missing information about the minority class leads to a significant degradation in classifier performance. Moreover, comprehensive research has proved that there are certain factors increasing the problem’s complexity. These additional difficulties are closely related to the data distribution over decision classes. In spite of numerous methods which have been proposed, the flexibility of existing solutions needs further improvement. Therefore, we offer a novel rough–granular computing approach (RGA, in short) to address the mentioned issues. New synthetic examples are generated only in specific regions of feature space. This selective oversampling approach is applied to reduce the number of misclassified minority class examples. A strategy relevant for a given problem is obtained by formation of information granules and an analysis of their degrees of inclusion in the minority class. Potential inconsistencies are eliminated by applying an editing phase based on a similarity relation. The most significant algorithm parameters are tuned in an iterative process. The set of evaluated parameters includes the number of nearest neighbours, complexity threshold, distance threshold and cardinality redundancy. Each data model is built by exploiting different parameters’ values. The results obtained by the experimental study on different datasets from the UCI repository are presented. They prove that the proposed method of inducing the neighbourhoods of examples is crucial in the proper creation of synthetic positive instances. The proposed algorithm outperforms related methods in most of the tested datasets. The set of valid parameters for the Rough–Granular Approach (RGA) technique is established. 相似文献
4.
5.
针对非平衡数据分类问题,提出了一种改进的SVM-KNN分类算法,在此基础上设计了一种集成学习模型.该模型采用限数采样方法对多数类样本进行分割,将分割后的多数类子簇与少数类样本重新组合,利用改进的SVM-KNN分别训练,得到多个基本分类器,对各个基本分类器进行组合.采用该模型对UCI数据集进行实验,结果显示该模型对于非平衡数据分类有较好的效果. 相似文献
6.
Qiao Shaojie Han Nan Huang Faliang Yue Kun Wu Tao Yi Yugen Mao Rui Yuan Chang-an 《Applied Intelligence》2022,52(7):7870-7889
Applied Intelligence - In the real-world applications of machine learning and cybernetics, the data with imbalanced distribution of classes or skewed class proportions is very pervasive. When... 相似文献
7.
8.
Qian Li Gang Li Wenjia Niu Yanan Cao Liang Chang Jianlong Tan Li Guo 《Frontiers of Computer Science》2017,11(5):836-851
Learning from imbalanced data is a challenging task in a wide range of applications, which attracts significant research efforts from machine learning and data mining community. As a natural approach to this issue, oversampling balances the training samples through replicating existing samples or synthesizing new samples. In general, synthesization outperforms replication by supplying additional information on the minority class. However, the additional information needs to follow the same normal distribution of the training set, which further constrains the new samples within the predefined range of training set. In this paper, we present the Wiener process oversampling (WPO) technique that brings the physics phenomena into sample synthesization. WPO constructs a robust decision region by expanding the attribute ranges in training set while keeping the same normal distribution. The satisfactory performance of WPO can be achieved with much lower computing complexity. In addition, by integrating WPO with ensemble learning, the WPOBoost algorithm outperformsmany prevalent imbalance learning solutions. 相似文献
9.
10.
One of the most widely used approaches to the class-imbalanced issue is ensemble learning. The base classifier is trained using an unbalanced training set in the conventional ensemble learning approach. We are unable to select the best suitable resampling method or base classifier for the training set, despite the fact that researchers have examined employing resampling strategies to balance the training set. A multi-armed bandit heterogeneous ensemble framework was developed as a solution to these issues. This framework employs the multi-armed bandit technique to pick the best base classifier and resampling techniques to build a heterogeneous ensemble model. To obtain training sets, we first employ the bagging technique. Then, we use the instances from the out-of-bag set as the validation set. In general, we consider the basic classifier combination with the highest validation set score to be the best model on the bagging subset and add it to the pool of model. The classification performance of the multi-armed bandit heterogeneous ensemble model is then assessed using 30 real-world imbalanced data sets that were gathered from UCI, KEEL, and HDDT. The experimental results demonstrate that, under the two assessment metrics of AUC and Kappa, the proposed heterogeneous ensemble model performs competitively with other nine state-of-the-art ensemble learning methods. At the same time, the findings of the experiment are confirmed by the statistical findings of the Friedman test and Holm's post-hoc test. 相似文献
11.
12.
The storage and labeling of industrial data incur significant costs during the development of defect detection algorithms. Active learning solves these problems by selecting the most informative data among the given unlabeled data. The existing active learning methods for image segmentation focus on studying natural images and medical images, with less attention given to industrial images, and little research has been performed on imbalanced data. To solve these problems, we propose an active learning framework to selecting informative data for defect segmentation under imbalanced data. In the initialization stage, the framework uses self-supervised learning to initialize the data so that the initialization data contain more defect data, thereby solving the cold-start problem. During the iterative stage, we design the main body of the active learning framework, which is composed of a segmentation learner and a reconstruction learner. These learners use supervised learning to further improve the framework’s ability to select informative data. The experimental results obtained on public and self-owned datasets show that the framework can save 70% of the required storage space and greatly reduce the cost of labeling. The intersection over union value proves that the designed framework can achieve the equivalent effect of labeling the whole dataset by labeling partial data. 相似文献
13.
Rongsheng GongSamuel H. Huang 《Expert systems with applications》2012,39(6):6192-6200
Classification is an important task in data mining. Class imbalance has been reported to hinder the performance of standard classification models. However, our study shows that class imbalance may not be the only cause to blame for poor performance. Rather, the underlying complexity of the problem may play a more fundamental role. In this paper, a decision tree method based on Kolmogorov-Smirnov statistic (K-S tree), is proposed to segment the training data so that a complex problem can be divided into several easier sub-problems where class imbalance becomes less challenging. K-S tree is also used to perform feature selection, which not only selects relevant variables but also removes redundant ones. After segmentation, a two-way re-sampling method is used at the segment level to empirically determine the optimal sampling percentage and the rebalanced data is used to fit logistic regression models, also at the segment level. The effectiveness of the proposed method is demonstrated through its application on property refinance prediction. 相似文献
14.
Pattern Analysis and Applications - Imbalanced learning is one of the substantial challenging problems in the field of data mining. The datasets that have skewed class distribution pose hindrance... 相似文献
15.
16.
针对现有机器学习算法难以有效提高贯序不均衡数据分类问题中少类样本分类精度的问题,提出一种基于混合采样策略的在线贯序极限学习机。该算法可在提高少类样本分类精度的前提下,减少多类样本的分类精度损失,主要包括离线和在线两个阶段:离线阶段采用均衡采样策略,利用主曲线分别构建多类和少类样本的可信区域,在不改变样本分布特性的前提下,利用可信区域扩充少类样本和削减多类样本,进而得到均衡的离线样本集,建立初始模型;在线阶段仅对贯序到达的多类数据进行欠采样,根据样本重要度挑选最具价值的多类样本,进而动态更新网络权值。通过理论分析证明所提算法在理论上存在损失信息上界。采用UCI标准数据集和实际的澳门空气污染预报数据进行仿真实验,结果表明,与现有在线贯序极限学习机(OS-ELM)、极限学习机(ELM)和元认知在线贯序极限学习机(MCOS-ELM)算法相比,所提算法对少类样本的预测精度更高,且数值稳定性良好。 相似文献
17.
18.
Since the overall prediction error of a classifier on imbalanced problems can be potentially misleading and biased, alternative performance measures such as G-mean and F-measure have been widely adopted. Various techniques including sampling and cost sensitive learning are often employed to improve the performance of classifiers in such situations. However, the training process of classifiers is still largely driven by traditional error based objective functions. As a result, there is clearly a gap between themeasure according to which the classifier is evaluated and how the classifier is trained. This paper investigates the prospect of explicitly using the appropriate measure itself to search the hypothesis space to bridge this gap. In the case studies, a standard threelayer neural network is used as the classifier, which is evolved by genetic algorithms (GAs) with G-mean as the objective function. Experimental results on eight benchmark problems show that the proposed method can achieve consistently favorable outcomes in comparison with a commonly used sampling technique. The effectiveness of multi-objective optimization in handling imbalanced problems is also demonstrated. 相似文献
19.
An information granulation based data mining approach for classifying imbalanced data 总被引:2,自引:0,他引:2
Recently, the class imbalance problem has attracted much attention from researchers in the field of data mining. When learning from imbalanced data in which most examples are labeled as one class and only few belong to another class, traditional data mining approaches do not have a good ability to predict the crucial minority instances. Unfortunately, many real world data sets like health examination, inspection, credit fraud detection, spam identification and text mining all are faced with this situation. In this study, we present a novel model called the “Information Granulation Based Data Mining Approach” to tackle this problem. The proposed methodology, which imitates the human ability to process information, acquires knowledge from Information Granules rather then from numerical data. This method also introduces a Latent Semantic Indexing based feature extraction tool by using Singular Value Decomposition, to dramatically reduce the data dimensions. In addition, several data sets from the UCI Machine Learning Repository are employed to demonstrate the effectiveness of our method. Experimental results show that our method can significantly increase the ability of classifying imbalanced data. 相似文献