首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 93 毫秒
1.
方丁  王刚 《计算机系统应用》2012,21(7):177-181,248
随着Web2.0的迅速发展,越来越多的用户乐于在互联网上分享自己的观点或体验。这类评论信息迅速膨胀,仅靠人工的方法难以应对网上海量信息的收集和处理,因此基于计算机的文本情感分类技术应运而生,并且研究的重点之一就是提高分类的精度。由于集成学习理论是提高分类精度的一种有效途径,并且已在许多领域显示出其优于单个分类器的良好性能,为此,提出基于集成学习理论的文本情感分类方法。实验结果显示三种常用的集成学习方法 Bagging、Boosting和Random Subspace对基础分类器的分类精度都有提高,并且在不同的基础分类器条件下,Random Subspace方法较Bagging和Boosting方法在统计意义上更优,以上结果进一步验证了集成学习理论在文本情感分类中应用的有效性。  相似文献   

2.
This paper introduces a new ensemble approach, Feature-Subspace Aggregating (Feating), which builds local models instead of global models. Feating is a generic ensemble approach that can enhance the predictive performance of both stable and unstable learners. In contrast, most existing ensemble approaches can improve the predictive performance of unstable learners only. Our analysis shows that the new approach reduces the execution time to generate a model in an ensemble through an increased level of localisation in Feating. Our empirical evaluation shows that Feating performs significantly better than Boosting, Random Subspace and Bagging in terms of predictive accuracy, when a stable learner SVM is used as the base learner. The speed up achieved by Feating makes feasible SVM ensembles that would otherwise be infeasible for large data sets. When SVM is the preferred base learner, we show that Feating SVM performs better than Boosting decision trees and Random Forests. We further demonstrate that Feating also substantially reduces the error of another stable learner, k-nearest neighbour, and an unstable learner, decision tree.  相似文献   

3.
Learning from imperfect (noisy) information sources is a challenging and reality issue for many data mining applications. Common practices include data quality enhancement by applying data preprocessing techniques or employing robust learning algorithms to avoid developing overly complicated structures that overfit the noise. The essential goal is to reduce noise impact and eventually enhance the learners built from noise-corrupted data. In this paper, we propose a novel corrective classification (C2) design, which incorporates data cleansing, error correction, Bootstrap sampling and classifier ensembling for effective learning from noisy data sources. C2 differs from existing classifier ensembling or robust learning algorithms in two aspects. On one hand, a set of diverse base learners of C2 constituting the ensemble are constructed via a Bootstrap sampling process; on the other hand, C2 further improves each base learner by unifying error detection, correction and data cleansing to reduce noise impact. Being corrective, the classifier ensemble is built from data preprocessed/corrected by the data cleansing and correcting modules. Experimental comparisons demonstrate that C2 is not only more accurate than the learner built from original noisy sources, but also more reliable than Bagging [4] or aggressive classifier ensemble (ACE) [56], which are two degenerated components/variants of C2. The comparisons also indicate that C2 is more stable than Boosting and DECORATE, which are two state-of-the-art ensembling methods. For real-world imperfect information sources (i.e. noisy training and/or test data), C2 is able to deliver more accurate and reliable prediction models than its other peers can offer.  相似文献   

4.
With the rapid growth and increased competition in credit industry, the corporate credit risk prediction is becoming more important for credit-granting institutions. In this paper, we propose an integrated ensemble approach, called RS-Boosting, which is based on two popular ensemble strategies, i.e., boosting and random subspace, for corporate credit risk prediction. As there are two different factors encouraging diversity in RS-Boosting, it would be advantageous to get better performance. Two corporate credit datasets are selected to demonstrate the effectiveness and feasibility of the proposed method. Experimental results reveal that RS-Boosting gets the best performance among seven methods, i.e., logistic regression analysis (LRA), decision tree (DT), artificial neural network (ANN), bagging, boosting and random subspace. All these results illustrate that RS-Boosting can be used as an alternative method for corporate credit risk prediction.  相似文献   

5.
Image annotation is posed as multi-class classification problem. Pursuing higher accuracy is a permanent but not stale challenge in the field of image annotation. To further improve the accuracy of image annotation, we propose a multi-view multi-label (abbreviated by MVML) learning algorithm, in which we take multiple feature (i.e., view) and ensemble learning into account simultaneously. By doing so, we make full use of the complementarity among the views and the base learners of ensemble learning, leading to higher accuracy of image annotation. With respect to the different distribution of positive and negative training examples, we propose two versions of MVML: the Boosting and Bagging versions of MVML. The former is suitable for learning over balanced examples while the latter applies to the opposite scenario. Besides, the weights of base learner is evaluated on validation data instead of training data, which will improve the generalization ability of the final ensemble classifiers. The experimental results have shown that the MVML is superior to the ensemble SVM of single view.  相似文献   

6.
Enterprise credit risk assessment has long been regarded as a critical topic and many statistical and intelligent methods have been explored for this issue. However there are no consistent conclusions on which methods are better. Recent researches suggest combining multiple classifiers, i.e., ensemble learning, may have a better performance. In this paper, we propose a new hybrid ensemble approach, called RSB-SVM, which is based on two popular ensemble strategies, i.e., bagging and random subspace and uses Support Vector Machine (SVM) as base learner. As there are two different factors, i.e., bootstrap selection of instances and random selection of features, encouraging diversity in RSB-SVM, it would be advantageous to get better performance. The enterprise credit risk dataset, which includes 239 companies’ financial records and is collected by the Industrial and Commercial Bank of China, is selected to demonstrate the effectiveness and feasibility of proposed method. Experimental results reveal that RSB-SVM can be used as an alternative method for enterprise credit risk assessment.  相似文献   

7.
基于支持向量机集成的故障诊断   总被引:3,自引:2,他引:3  
为提高故障诊断的准确性,提出了一种基于遗传算法的支持向量机集成学习方法,定义了相应的遗传操作算子,并探讨了集成下的分类器的构造策略。对汽轮机转子不平衡故障诊断的仿真实验结果表明,集成学习方法的性能通常优于单个支持向量机,而所提方法性能则优于Bagging与Boosting等传统集成学习方法,获得的集成所包括的分类器数目更少,而且结合多种分类器构造策略可提高分类器的多样性。该方法能容易地推广到神经网络、决策树等其他学习算法。  相似文献   

8.
差异性是分类器集成具有高泛化能力的必要条件. 然而,目前对差异性度量、有效性及分类器优化集成都没有统一的分析和处理方法. 针对上述问题,本文一方面从差异性度量方法、差异性度量有效性分析和相应的分类器优化集成技术三个角度,全面总结与分析了基于差异性的分类器集成. 同时,本文还通过向量空间模型形象地论证了差异性度量的有效性. 另一方面,本文针对多种典型的基于差异性的分类器集成技术(Bagging,boosting GA-based,quadratic programming (QP)、semi-definite programming (SDP)、regularized selective ensemble (RSE))在UCI数据库和USPS数据库上进行了对比实验与性能分析,并对如何选择差异性度量方法和具体的优化集成技术给出了可行性建议.  相似文献   

9.
Classification with imbalanced data-sets has become one of the most challenging problems in Data Mining. Being one class much more represented than the other produces undesirable effects in both the learning and classification processes, mainly regarding the minority class. Such a problem needs accurate tools to be undertaken; lately, ensembles of classifiers have emerged as a possible solution. Among ensemble proposals, the combination of Bagging and Boosting with preprocessing techniques has proved its ability to enhance the classification of the minority class.In this paper, we develop a new ensemble construction algorithm (EUSBoost) based on RUSBoost, one of the simplest and most accurate ensemble, which combines random undersampling with Boosting algorithm. Our methodology aims to improve the existing proposals enhancing the performance of the base classifiers by the usage of the evolutionary undersampling approach. Besides, we promote diversity favoring the usage of different subsets of majority class instances to train each base classifier. Centered on two-class highly imbalanced problems, we will prove, supported by the proper statistical analysis, that EUSBoost is able to outperform the state-of-the-art methods based on ensembles. We will also analyze its advantages using kappa-error diagrams, which we adapt to the imbalanced scenario.  相似文献   

10.

Supply chain finance (SCF) becomes more important for small- and medium-sized enterprises (SMEs) due to global credit crunch, supply chain financing woes and tightening credit criteria for corporate lending. Currently, predicting SME credit risk is significant for guaranteeing SCF in smooth operation. In this paper, we apply six methods, i.e., one individual machine learning (IML, i.e., decision tree) method, three ensemble machine learning methods [EML, i.e., bagging, boosting, and random subspace (RS)], and two integrated ensemble machine learning methods (IEML, i.e., RS–boosting and multi-boosting), to predict SMEs credit risk in SCF and compare the effectiveness and feasibility of six methods. In the experiment, we choose the quarterly financial and non-financial data of 48 listed SMEs from Small and Medium Enterprise Board of Shenzhen Stock Exchange, six listed core enterprises (CEs) from Shanghai Stock Exchange and three listed CEs from Shenzhen Stock Exchange during the period of 2012–2013 as the empirical samples. Experimental results reveal that the IEML methods acquire better performance than IML and EML method. In particular, RS–boosting is the best method to predict SMEs credit risk among six methods.

  相似文献   

11.
基于样本权重更新的不平衡数据集成学习方法   总被引:1,自引:0,他引:1  
不平衡数据的问题普遍存在于大数据、机器学习的各个应用领域,如医疗诊断、异常检测等。研究者提出或采用了多种方法来进行不平衡数据的学习,比如数据采样(如SMOTE)或者集成学习(如EasyEnsemble)的方法。数据采样中的过采样方法可能存在过拟合或边界样本分类准确率较低等问题,而欠采样方法则可能导致欠拟合。文中将SMOTE,Bagging,Boosting等算法的基本思想进行融合,提出了Rotation SMOTE算法。该算法通过在Boosting过程中根据基分类器的预测结果对少数类样本进行SMOTE来间接地增大少数类样本的权重,并借鉴Focal Loss的基本思想提出了根据基分类器预测结果直接优化AdaBoost权重更新策略的FocalBoost算法。对不同应用领域共11个不平衡数据集的多个评价指标进行实验测试,结果表明,相比于其他不平衡数据算法(包括SMOTEBoost算法和EasyEnsemble算法),Rotation SMOTE算法在所有数据集上具有最高的召回率,并且在大多数数据集上具有最佳或者次佳的G-mean以及F1Score;而相比于原始的AdaBoost,FocalBoost则在其中9个不平衡数据集上都获得了更优的性能指标。  相似文献   

12.
We present an extensive empirical comparison between nineteen prototypical supervised ensemble learning algorithms, including Boosting, Bagging, Random Forests, Rotation Forests, Arc-X4, Class-Switching and their variants, as well as more recent techniques like Random Patches. These algorithms were compared against each other in terms of threshold, ranking/ordering and probability metrics over nineteen UCI benchmark data sets with binary labels. We also examine the influence of two base learners, CART and Extremely Randomized Trees, on the bias–variance decomposition and the effect of calibrating the models via Isotonic Regression on each performance metric. The selected data sets were already used in various empirical studies and cover different application domains. The source code and the detailed results of our study are publicly available.  相似文献   

13.
Several methods (e.g., Bagging, Boosting) of constructing and combining an ensemble of classifiers have recently been shown capable of improving accuracy of a class of commonly used classifiers (e.g., decision trees, neural networks). The accuracy gain achieved, however, is at the expense of a higher requirement for storage and computation. This storage and computation overhead can decrease the utility of these methods when applied to real-world situations. In this Letter, we propose a learning approach which allows a single neural network to approximate a given ensemble of classifiers. Experiments on a large number of real-world data sets show that this approach can substantially save storage and computation while still maintaining accuracy similar to that of the entire ensemble.  相似文献   

14.
个体学习器的差异度是集成学习中的关键因素。流行的集成学习算法如Bagging通过重取样技术产生个体学习器的差异度。选择性集成从集成学习算法产生的个体学习器中选择一部分来集成,结果表明比原集成更好。但如何选择学习器是个难题。使用Q统计量度量两个学习器的差异度,提出一种新的决策树选择性集成学习方法。与C4.5,Bagging方法相比,表现出很好的效果。  相似文献   

15.
The prediction of bankruptcy for financial companies, especially banks, has been extensively researched area and creditors, auditors, stockholders and senior managers are all interested in bank bankruptcy prediction. In this paper, three common machine learning models namely Logistic, J48 and Voted Perceptron are used as the base learners. In addition, an attribute-base ensemble learning method namely Random Subspaces and two instance-base ensemble learning methods namely Bagging and Multi-Boosting are employed to enhance the prediction accuracy of conventional machine learning models for bank failure prediction. The models are grouped in the following families of approaches: (i) conventional machine learning models, (ii) ensemble learning models and (iii) hybrid ensemble learning models. Experimental results indicate a clear outperformance of hybrid ensemble machine learning models over conventional base and ensemble models. These results indicate that hybrid ensemble learning models can be used as a reliable predicting model for bank failures.  相似文献   

16.
Credit scoring is an effective tool for banks to properly guide decision profitably on granting loans. Ensemble methods, which according to their structures can be divided into parallel and sequential ensembles, have been recently developed in the credit scoring domain. These methods have proven their superiority in discriminating borrowers accurately. However, among the ensemble models, little consideration has been provided to the following: (1) highlighting the hyper-parameter tuning of base learner despite being critical to well-performed ensemble models; (2) building sequential models (i.e., boosting, as most have focused on developing the same or different algorithms in parallel); and (3) focusing on the comprehensibility of models. This paper aims to propose a sequential ensemble credit scoring model based on a variant of gradient boosting machine (i.e., extreme gradient boosting (XGBoost)). The model mainly comprises three steps. First, data pre-processing is employed to scale the data and handle missing values. Second, a model-based feature selection system based on the relative feature importance scores is utilized to remove redundant variables. Third, the hyper-parameters of XGBoost are adaptively tuned with Bayesian hyper-parameter optimization and used to train the model with selected feature subset. Several hyper-parameter optimization methods and baseline classifiers are considered as reference points in the experiment. Results demonstrate that Bayesian hyper-parameter optimization performs better than random search, grid search, and manual search. Moreover, the proposed model outperforms baseline models on average over four evaluation measures: accuracy, error rate, the area under the curve (AUC) H measure (AUC-H measure), and Brier score. The proposed model also provides feature importance scores and decision chart, which enhance the interpretability of credit scoring model.  相似文献   

17.
This research aims to evaluate ensemble learning (bagging, boosting, and modified bagging) potential in predicting microbially induced concrete corrosion in sewer systems from the data mining (DM) perspective. Particular focus is laid on ensemble techniques for network-based DM methods, including multi-layer perceptron neural network (MLPNN) and radial basis function neural network (RBFNN) as well as tree-based DM methods, such as chi-square automatic interaction detector (CHAID), classification and regression tree (CART), and random forests (RF). Hence, an interdisciplinary approach is presented by combining findings from material sciences and hydrochemistry as well as data mining analyses to predict concrete corrosion. The effective factors on concrete corrosion such as time, gas temperature, gas-phase H2S concentration, relative humidity, pH, and exposure phase are considered as the models’ inputs. All 433 datasets are randomly selected to construct an individual model and twenty component models of boosting, bagging, and modified bagging based on training, validating, and testing for each DM base learners. Considering some model performance indices, (e.g., Root mean square error, RMSE; mean absolute percentage error, MAPE; correlation coefficient, r) the best ensemble predictive models are selected. The results obtained indicate that the prediction ability of the random forests DM model is superior to the other ensemble learners, followed by the ensemble Bag-CHAID method. On average, the ensemble tree-based models acted better than the ensemble network-based models; nevertheless, it was also found that taking the advantages of ensemble learning would enhance the general performance of individual DM models by more than 10%.  相似文献   

18.
近年来,软件缺陷预测的研究引起了大量关注。软件缺陷预测中普遍存在类别不平衡问题,即有缺陷样本要远少于无缺陷样本,而有缺陷样本又是预测的重点。上述问题使得缺陷预测模型的性能难以满足用户要求,有必要对不平衡数据进行有效处理。目前,采样法和集成学习方法已成为处理不平衡数据的2类重要方法,很多学者提出了不同的过采样方法和集成学习方法。本文研究如何把这2类方法更好地组合在一起,从而有效地处理缺陷预测中的类别不平衡问题。对此,选取RandomOverSampler、SMOTE、Borderline-SMOTE和ADASYN这4种常见的过采样方法以及Bagging、Random Forest、AdaBoost和GBDT这4种常用的集成学习方法,分别将一种过采样方法与一种集成方法组合在一起,从而形成不同的组合。通过对比每一种组合的缺陷预测性能,从而获得最优组合,为缺陷预测中不平衡问题的处理提供有益参考。实验表明,过采样方法ADASYN在处理不平衡问题方面更有优势,它与集成方法GBDT的组合表现最优,相对于其他组合具有更好的缺陷预测性能。  相似文献   

19.
Bagging, boosting, rotation forest and random subspace methods are well known re-sampling ensemble methods that generate and combine a diversity of learners using the same learning algorithm for the base-classifiers. Boosting and rotation forest algorithms are considered stronger than bagging and random subspace methods on noise-free data. However, there are strong empirical indications that bagging and random subspace methods are much more robust than boosting and rotation forest in noisy settings. For this reason, in this work we built an ensemble of bagging, boosting, rotation forest and random subspace methods ensembles with 6 sub-classifiers in each one and then a voting methodology is used for the final prediction. We performed a comparison with simple bagging, boosting, rotation forest and random subspace methods ensembles with 25 sub-classifiers, as well as other well known combining methods, on standard benchmark datasets and the proposed technique had better accuracy in most cases.  相似文献   

20.
With the recent financial crisis and European debt crisis, corporate bankruptcy prediction has become an increasingly important issue for financial institutions. Many statistical and intelligent methods have been proposed, however, there is no overall best method has been used in predicting corporate bankruptcy. Recent studies suggest ensemble learning methods may have potential applicability in corporate bankruptcy prediction. In this paper, a new and improved Boosting, FS-Boosting, is proposed to predict corporate bankruptcy. Through injecting feature selection strategy into Boosting, FS-Booting can get better performance as base learners in FS-Boosting could get more accuracy and diversity. For the testing and illustration purposes, two real world bankruptcy datasets were selected to demonstrate the effectiveness and feasibility of FS-Boosting. Experimental results reveal that FS-Boosting could be used as an alternative method for the corporate bankruptcy prediction.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号