首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Rotation Forest, an effective ensemble classifier generation technique, works by using principal component analysis (PCA) to rotate the original feature axes so that different training sets for learning base classifiers can be formed. This paper presents a variant of Rotation Forest, which can be viewed as a combination of Bagging and Rotation Forest. Bagging is used here to inject more randomness into Rotation Forest in order to increase the diversity among the ensemble membership. The experiments conducted with 33 benchmark classification data sets available from the UCI repository, among which a classification tree is adopted as the base learning algorithm, demonstrate that the proposed method generally produces ensemble classifiers with lower error than Bagging, AdaBoost and Rotation Forest. The bias–variance analysis of error performance shows that the proposed method improves the prediction error of a single classifier by reducing much more variance term than the other considered ensemble procedures. Furthermore, the results computed on the data sets with artificial classification noise indicate that the new method is more robust to noise and kappa-error diagrams are employed to investigate the diversity–accuracy patterns of the ensemble classifiers.  相似文献   

2.
为提高决策树的集成分类精度,介绍了一种基于特征变换的旋转森林分类器集成算法,通过对数据属性集的随机分割,并在属性子集上对抽取的子样本数据进行主成分分析,以构造新的样本数据,达到增大基分类器差异性及提高预测准确率的目的。在Weka平台下,分别采用Bagging、AdaBoost及旋转森林算法对剪枝与未剪枝的J48决策树分类算法进行集成的对比试验,以10次10折交叉验证的平均准确率为比较依据。结果表明旋转森林算法的预测精度优于其他两个算法,验证了旋转森林是一种有效的决策树分类器集成算法。  相似文献   

3.
The Rotation Forest classifier is a successful ensemble method for a wide variety of data mining applications. However, the way in which Rotation Forest transforms the feature space through PCA, although powerful, penalizes training and prediction times, making it unfeasible for Big Data. In this paper, a MapReduce Rotation Forest and its implementation under the Spark framework are presented. The proposed MapReduce Rotation Forest behaves in the same way as the standard Rotation Forest, training the base classifiers on a rotated space, but using a functional implementation of the rotation that enables its execution in Big Data frameworks. Experimental results are obtained using different cloud-based cluster configurations. Bayesian tests are used to validate the method against two ensembles for Big Data: Random Forest and PCARDE classifiers. Our proposal incorporates the parallelization of both the PCA calculation and the tree training, providing a scalable solution that retains the performance of the original Rotation Forest and achieves a competitive execution time (in average, at training, more than 3 times faster than other PCA-based alternatives). In addition, extensive experimentation shows that by setting some parameters of the classifier (i.e., bootstrap sample size, number of trees, and number of rotations), the execution time is reduced with no significant loss of performance using a small ensemble.  相似文献   

4.
Several studies have demonstrated the superior performance of ensemble classification algorithms, whereby multiple member classifiers are combined into one aggregated and powerful classification model, over single models. In this paper, two rotation-based ensemble classifiers are proposed as modeling techniques for customer churn prediction. In Rotation Forests, feature extraction is applied to feature subsets in order to rotate the input data for training base classifiers, while RotBoost combines Rotation Forest with AdaBoost. In an experimental validation based on data sets from four real-life customer churn prediction projects, Rotation Forest and RotBoost are compared to a set of well-known benchmark classifiers. Moreover, variations of Rotation Forest and RotBoost are compared, implementing three alternative feature extraction algorithms: principal component analysis (PCA), independent component analysis (ICA) and sparse random projections (SRP). The performance of rotation-based ensemble classifier is found to depend upon: (i) the performance criterion used to measure classification performance, and (ii) the implemented feature extraction algorithm. In terms of accuracy, RotBoost outperforms Rotation Forest, but none of the considered variations offers a clear advantage over the benchmark algorithms. However, in terms of AUC and top-decile lift, results clearly demonstrate the competitive performance of Rotation Forests compared to the benchmark algorithms. Moreover, ICA-based Rotation Forests outperform all other considered classifiers and are therefore recommended as a well-suited alternative classification technique for the prediction of customer churn that allows for improved marketing decision making.  相似文献   

5.
Automatic keyword extraction is an important research direction in text mining, natural language processing and information retrieval. Keyword extraction enables us to represent text documents in a condensed way. The compact representation of documents can be helpful in several applications, such as automatic indexing, automatic summarization, automatic classification, clustering and filtering. For instance, text classification is a domain with high dimensional feature space challenge. Hence, extracting the most important/relevant words about the content of the document and using these keywords as the features can be extremely useful. In this regard, this study examines the predictive performance of five statistical keyword extraction methods (most frequent measure based keyword extraction, term frequency-inverse sentence frequency based keyword extraction, co-occurrence statistical information based keyword extraction, eccentricity-based keyword extraction and TextRank algorithm) on classification algorithms and ensemble methods for scientific text document classification (categorization). In the study, a comprehensive study of comparing base learning algorithms (Naïve Bayes, support vector machines, logistic regression and Random Forest) with five widely utilized ensemble methods (AdaBoost, Bagging, Dagging, Random Subspace and Majority Voting) is conducted. To the best of our knowledge, this is the first empirical analysis, which evaluates the effectiveness of statistical keyword extraction methods in conjunction with ensemble learning algorithms. The classification schemes are compared in terms of classification accuracy, F-measure and area under curve values. To validate the empirical analysis, two-way ANOVA test is employed. The experimental analysis indicates that Bagging ensemble of Random Forest with the most-frequent based keyword extraction method yields promising results for text classification. For ACM document collection, the highest average predictive performance (93.80%) is obtained with the utilization of the most frequent based keyword extraction method with Bagging ensemble of Random Forest algorithm. In general, Bagging and Random Subspace ensembles of Random Forest yield promising results. The empirical analysis indicates that the utilization of keyword-based representation of text documents in conjunction with ensemble learning can enhance the predictive performance and scalability of text classification schemes, which is of practical importance in the application fields of text classification.  相似文献   

6.
Churn prediction in telecom has recently gained substantial interest of stakeholders because of associated revenue losses. Predicting telecom churners, is a challenging problem due to the enormous nature of the telecom datasets. In this regard, we propose an intelligent churn prediction system for telecom by employing efficient feature extraction technique and ensemble method. We have used Random Forest, Rotation Forest, RotBoost and DECORATE ensembles in combination with minimum redundancy and maximum relevance (mRMR), Fisher’s ratio and F-score methods to model the telecom churn prediction problem. We have observed that mRMR method returns most explanatory features compared to Fisher’s ratio and F-score, which significantly reduces the computations and help ensembles in attaining improved performance. In comparison to Random Forest, Rotation Forest and DECORATE, RotBoost in combination with mRMR features attains better prediction performance on the standard telecom datasets. The better performance of RotBoost ensemble is largely attributed to the rotation of feature space, which enables the base classifier to learn different aspects of the churners and non-churners. Moreover, the Adaboosting process in RotBoost also contributes in achieving higher prediction accuracy by handling hard instances. The performance evaluation is conducted on standard telecom datasets using AUC, sensitivity and specificity based measures. Simulation results reveal that the proposed approach based on RotBoost in combination with mRMR features (CP-MRB) is effective in handling high dimensionality of the telecom datasets. CP-MRB offers higher accuracy in predicting churners and thus is quite prospective in modeling the challenging problems of customer churn prediction in telecom.  相似文献   

7.
Two new methods for tree ensemble construction are presented: G-Forest and GAR-Forest. In a similar way to Random Forest, the tree construction process entails a degree of randomness.The same strategy used in the GRASP metaheuristic for generating random and adaptive solutions is used at each node of the trees. The source of diversity of the ensemble is the randomness of the solution generation method of GRASP. A further key feature of the tree construction method for GAR-Forest is a decreasing level of randomness during the process of constructing the tree: maximum randomness at the root and minimum randomness at the leaves. The method is therefore named “GAR”, GRASP with annealed randomness.The results conclusively demonstrate that G-Forest and GAR-Forest outperform Bagging, AdaBoost, MultiBoost, Random Forest and Random Subspaces. The results are even more convincing in the presence of noise, demonstrating the robustness of the method.The relationship between base classifier accuracy and their diversity is analysed by application of kappa-error diagrams and a variant of these called kappa-error relative movement diagrams.  相似文献   

8.
Generalized additive models (GAMs) are a generalization of generalized linear models (GLMs) and constitute a powerful technique which has successfully proven its ability to capture nonlinear relationships between explanatory variables and a response variable in many domains. In this paper, GAMs are proposed as base classifiers for ensemble learning. Three alternative ensemble strategies for binary classification using GAMs as base classifiers are proposed: (i) GAMbag based on Bagging, (ii) GAMrsm based on the Random Subspace Method (RSM), and (iii) GAMens as a combination of both. In an experimental validation performed on 12 data sets from the UCI repository, the proposed algorithms are benchmarked to a single GAM and to decision tree based ensemble classifiers (i.e. RSM, Bagging, Random Forest, and the recently proposed Rotation Forest). From the results a number of conclusions can be drawn. Firstly, the use of an ensemble of GAMs instead of a single GAM always leads to improved prediction performance. Secondly, GAMrsm and GAMens perform comparably, while both versions outperform GAMbag. Finally, the value of using GAMs as base classifiers in an ensemble instead of standard decision trees is demonstrated. GAMbag demonstrates performance comparable to ordinary Bagging. Moreover, GAMrsm and GAMens outperform RSM and Bagging, while these two GAM ensemble variations perform comparably to Random Forest and Rotation Forest. Sensitivity analyses are included for the number of member classifiers in the ensemble, the number of variables included in a random feature subspace and the number of degrees of freedom for GAM spline estimation.  相似文献   

9.
一种基于旋转森林的集成协同训练算法   总被引:1,自引:0,他引:1       下载免费PDF全文
集成协同训练算法(ensemble co-training)是将集成学习(ensemble learning)和协同训练算法(co-training)相结合的半监督学习方法,旋转森林(rotation forest)是利用特征提取来构造基分类器差异性的集成学习方法,在对现有的集成协同训练算法研究基础上,提出了基于旋转森林的协同训练算法——ROFCO,该方法重在利用未标记数据提高基分类器之间的差异性和特征提取效果,使基分类器的泛化误差保持不变或下降的同时,能保持甚至提高基分类器之间的差异性,提高集成效果。实验结果表明该方法能取得较好效果。  相似文献   

10.
为解决垃圾网页检测过程中的“维数灾难”和不平衡分类问题,提出一种基于免疫克隆特征选择和欠采样(US)集成的二元分类器算法。首先,使用欠采样技术将训练样本集大类抽样成多个与小类样本数相近的样本集,再将其分别与小类样本合并构成多个平衡的子训练样本集;然后,设计一种免疫克隆算法遴选出多个最优的特征子集;基于最优特征子集对平衡的子样本集进行投影操作,生成平衡数据集的多个视图;最后,用随机森林(RF)分类器对测试样本进行分类,采用简单投票法确定测试样本的最终类别。在WEBSPAM UK-2006数据集上的实验结果表明,该集成分类器算法应用于垃圾网页检测:与随机森林算法及其Bagging和AdaBoost集成分类器算法相比,准确率、F1测度、AUC等指标均提高11%以上;与其他最优的研究结果相比,该集成分类器算法在F1测度上提高2%,在AUC上达到最优。  相似文献   

11.
陈松峰  范明 《计算机科学》2010,37(8):236-239256
提出了一种使用基于贝叶斯的基分类器建立组合分类器的新方法PCABoost.本方法在创建训练样本时,随机地将特征集划分成K个子集,使用PCA得到每个子集的主成分,形成新的特征空间,并将全部的训练数据映射到新的特征空间作为新的训练集.通过不同的变换生成不同的特征空间,从而产生若干个有差异的训练集.在每一个新的训练集上利用AdaBoost建立一组基于贝叶斯的逐渐提升的分类器(即一个分类器组),这样就建立了若干个有差异的分类器组,然后在每个分类器组内部通过加权投票产生一个预测,再把每个组的预测通过投票来产生组合分类器的分类结果,最终建立一个具有两层组合的组合分类器.从UCI标准数据集中随机选取30个数据集进行实验.结果表明,本算法不仅能够显著提高基于贝叶斯的分类器的分类性能,而且与Rotation Forest和AdaBoost等组合方法相比,在大部分数据集上都具有更高的分类准确率.  相似文献   

12.
This paper proposes a method for constructing ensembles of decision trees, random feature weights (RFW). The method is similar to Random Forest, they are methods that introduce randomness in the construction method of the decision trees. In Random Forest only a random subset of attributes are considered for each node, but RFW considers all of them. The source of randomness is a weight associated with each attribute. All the nodes in a tree use the same set of random weights but different from the set of weights in other trees. So, the importance given to the attributes will be different in each tree and that will differentiate their construction. The method is compared to Bagging, Random Forest, Random-Subspaces, AdaBoost and MultiBoost, obtaining favourable results for the proposed method, especially when using noisy data sets. RFW can be combined with these methods. Generally, the combination of RFW with other method produces better results than the combined methods. Kappa-error diagrams and Kappa-error movement diagrams are used to analyse the relationship between the accuracies of the base classifiers and their diversity.  相似文献   

13.
针对如何提高集成学习的性能,提出一种结合Rotation Forest和Multil3oost的集成学习方法—利用Rotation Forest中旋转变换的思想对原始数据集进行变换,旨在增加分类器间的差异度;利用Mu1tiI3oost在变换后的数据集上训练基分类器,旨在提高基分类器的准确度。最后用简单的多数投票法融合各基分类器的决策结果,将其作为集成分类器的输出。为了验证该方法的有效性,在公共数据集UCI上进行了实验,结果显示,该方法可获得较高的分类精度。  相似文献   

14.
A classifier ensemble is a set of classifiers whose individual decisions are combined to classify new examples. Classifiers, which can represent complex decision boundaries are accurate. Kernel functions can also represent complex decision boundaries. In this paper, we study the usefulness of kernel features for decision tree ensembles as they can improve the representational power of individual classifiers. We first propose decision tree ensembles based on kernel features and found that the performance of these ensembles is strongly dependent on the kernel parameters; the selected kernel and the dimension of the kernel feature space. To overcome this problem, we present another approach to create ensembles that combines the existing ensemble methods with the kernel machine philosophy. In this approach, kernel features are created and concatenated with the original features. The classifiers of an ensemble are trained on these extended feature spaces. Experimental results suggest that the approach is quite robust to the selection of parameters. Experiments also show that different ensemble methods (Random Subspace, Bagging, Adaboost.M1 and Random Forests) can be improved by using this approach.  相似文献   

15.
Training set resampling based ensemble design techniques are successfully used to reduce the classification errors of the base classifiers. Boosting is one of the techniques used for this purpose where each training set is obtained by drawing samples with replacement from the available training set according to a weighted distribution which is modified for each new classifier to be included in the ensemble. The weighted resampling results in a classifier set, each being accurate in different parts of the input space mainly specified the sample weights. In this study, a dynamic integration of boosting based ensembles is proposed so as to take into account the heterogeneity of the input sets. An evidence-theoretic framework is developed for this purpose so as to take into account the weights and distances of the neighboring training samples in both training and testing boosting based ensembles. The effectiveness of the proposed technique is compared to the AdaBoost algorithm using three different base classifiers.  相似文献   

16.
Many techniques have been proposed for credit risk assessment, from statistical models to artificial intelligence methods. During the last few years, different approaches to classifier ensembles have successfully been applied to credit scoring problems, demonstrating to be generally more accurate than single prediction models. The present paper goes one step beyond by introducing composite ensembles that jointly use different strategies for diversity induction. Accordingly, the combination of data resampling algorithms (bagging and AdaBoost) and attribute subset selection methods (random subspace and rotation forest) for the construction of composite ensembles is explored with the aim of improving the prediction performance. The experimental results and statistical tests show that this new two-level classifier ensemble constitutes an appropriate solution for credit scoring problems, performing better than the traditional single ensembles and very significantly better than individual classifiers.  相似文献   

17.
基于样本权重更新的不平衡数据集成学习方法   总被引:1,自引:0,他引:1  
不平衡数据的问题普遍存在于大数据、机器学习的各个应用领域,如医疗诊断、异常检测等。研究者提出或采用了多种方法来进行不平衡数据的学习,比如数据采样(如SMOTE)或者集成学习(如EasyEnsemble)的方法。数据采样中的过采样方法可能存在过拟合或边界样本分类准确率较低等问题,而欠采样方法则可能导致欠拟合。文中将SMOTE,Bagging,Boosting等算法的基本思想进行融合,提出了Rotation SMOTE算法。该算法通过在Boosting过程中根据基分类器的预测结果对少数类样本进行SMOTE来间接地增大少数类样本的权重,并借鉴Focal Loss的基本思想提出了根据基分类器预测结果直接优化AdaBoost权重更新策略的FocalBoost算法。对不同应用领域共11个不平衡数据集的多个评价指标进行实验测试,结果表明,相比于其他不平衡数据算法(包括SMOTEBoost算法和EasyEnsemble算法),Rotation SMOTE算法在所有数据集上具有最高的召回率,并且在大多数数据集上具有最佳或者次佳的G-mean以及F1Score;而相比于原始的AdaBoost,FocalBoost则在其中9个不平衡数据集上都获得了更优的性能指标。  相似文献   

18.
This paper presents a novel ensemble classifier framework for improved classification of mammographic lesions in Computer-aided Detection (CADe) and Diagnosis (CADx) systems. Compared to previously developed classification techniques in mammography, the main novelty of proposed method is twofold: (1) the “combined use” of different feature representations (of the same instance) and data resampling to generate more diverse and accurate base classifiers as ensemble members and (2) the incorporation of a novel “ensemble selection” mechanism to further maximize the overall classification performance. In addition, as opposed to conventional ensemble learning, our proposed ensemble framework has the advantage of working well with both weak and strong classifiers, extensively used in mammography CADe and/or CADx systems. Extensive experiments have been performed using benchmark mammogram dataset to test the proposed method on two classification applications: (1) false-positive (FP) reduction using classification between masses and normal tissues, and (2) diagnosis using classification between malignant and benign masses. Results showed promising results that the proposed method (area under the ROC curve (AUC) of 0.932 and 0.878, each obtained for the aforementioned two classification applications, respectively) impressively outperforms (by an order of magnitude) the most commonly used single neural network (AUC = 0.819 and AUC =0.754) and support vector machine (AUC = 0.849 and AUC = 0.773) based classification approaches. In addition, the feasibility of our method has been successfully demonstrated by comparing other state-of-the-art ensemble classification techniques such as Gentle AdaBoost and Random Forest learning algorithms.  相似文献   

19.
Increasing the accuracy of thematic maps produced through the process of image classification has been a hot topic in remote sensing. For this aim, various strategies, classifiers, improvements, and their combinations have been suggested in the literature. Ensembles that combine the prediction of individual classifiers with weights based on the estimated prediction accuracies are strategies aiming to improve the classifier performances. One of the recently introduced ensembles is the rotation forest, which is based on the idea of building accurate and diverse classifiers by applying feature extraction to the training sets and then reconstructing new training sets for each classifier. In this study, the effectiveness of the rotation forest was investigated for decision trees in land-use and land-cover (LULC) mapping, and its performance was compared with performances of the six most widely used ensemble methods. The results were verified for the effectiveness of the rotation forest ensemble as it produced the highest classification accuracies for the selected satellite data. When the statistical significance of differences in performances was analysed using McNemar's tests based on normal and chi-squared distributions, it was found that the rotation forest method outperformed the bagging, Diverse Ensemble Creation by Oppositional Relabelling of Artificial Training Examples (DECORATE), and random subspace methods, whereas the performance differences with the other ensembles were statistically insignificant.  相似文献   

20.
We present an extensive empirical comparison between nineteen prototypical supervised ensemble learning algorithms, including Boosting, Bagging, Random Forests, Rotation Forests, Arc-X4, Class-Switching and their variants, as well as more recent techniques like Random Patches. These algorithms were compared against each other in terms of threshold, ranking/ordering and probability metrics over nineteen UCI benchmark data sets with binary labels. We also examine the influence of two base learners, CART and Extremely Randomized Trees, on the bias–variance decomposition and the effect of calibrating the models via Isotonic Regression on each performance metric. The selected data sets were already used in various empirical studies and cover different application domains. The source code and the detailed results of our study are publicly available.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号