首页 | 官方网站   微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
Ensemble methods have been shown to be an effective tool for solving multi-label classification tasks. In the RAndom k-labELsets (RAKEL) algorithm, each member of the ensemble is associated with a small randomly-selected subset of k labels. Then, a single label classifier is trained according to each combination of elements in the subset. In this paper we adopt a similar approach, however, instead of randomly choosing subsets, we select the minimum required subsets of k labels that cover all labels and meet additional constraints such as coverage of inter-label correlations. Construction of the cover is achieved by formulating the subset selection as a minimum set covering problem (SCP) and solving it by using approximation algorithms. Every cover needs only to be prepared once by offline algorithms. Once prepared, a cover may be applied to the classification of any given multi-label dataset whose properties conform with those of the cover. The contribution of this paper is two-fold. First, we introduce SCP as a general framework for constructing label covers while allowing the user to incorporate cover construction constraints. We demonstrate the effectiveness of this framework by proposing two construction constraints whose enforcement produces covers that improve the prediction performance of random selection by achieving better coverage of labels and inter-label correlations. Second, we provide theoretical bounds that quantify the probabilities of random selection to produce covers that meet the proposed construction criteria. The experimental results indicate that the proposed methods improve multi-label classification accuracy and stability compared to the RAKEL algorithm and to other state-of-the-art algorithms.  相似文献   

概念格(Galois格)是一种进行数据分类学习的有效工具,然而建格规模庞大使分类效率和准确率受到较大影响.将粗糙度理论应用到概念格分类问题研究中,提出一种新型的近似概念格动态建格和分类挖掘集成学习模型(CACLR).该模型在粗糙度区间根据样本空间分布构建多个相对独立分布且比较精确的近似概念格分类器,能及时消除建格过程中大量与分类知识无关的节点,有效缩减原格规模,融合得到的分类挖掘集成学习模型,具有较好的粗糙分类精度和知识预测学习能力.最后进行CACLR分类集成学习模型在标准UCI数据集中的对比实验,有效验证了该模型的实用价值.  相似文献   

高锋  黄海燕 《计算机科学》2017,44(8):225-229
不平衡数据严重影响了传统分类算法的性能,导致少数类的识别率降低。提出一种基于邻域特征的混合抽样技术,该技术根据样本邻域中的类别分布特征来确定采样权重,进而采用混合抽样的方法来获得平衡的数据集;然后采用一种基于局部置信度的动态集成方法,通过分类学习生成基分类器,对于每个检验的样本,根据局部分类精度动态地选择最优的基分类器进行组合。通过UCI标准数据集上的实验表明,该方法能够同时提高不平衡数据中少数类和多数类的分类精度。  相似文献   

This paper proposes a complete framework to assess the overall performance of classification models from a user perspective in terms of accuracy, comprehensibility, and justifiability. A review is provided of accuracy and comprehensibility measures, and a novel metric is introduced that allows one to measure the justifiability of classification models. Furthermore, taxonomy of domain constraints is introduced, and an overview of the existing approaches to impose constraints and include domain knowledge in data mining techniques is presented. Finally, justifiability metric is applied to a credit scoring and customer churn prediction case.  相似文献   

Multirelational classification: a multiple view approach   总被引:1,自引:0,他引:1  
Multirelational classification aims at discovering useful patterns across multiple inter-connected tables (relations) in a relational database. Many traditional learning techniques, however, assume a single table or a flat file as input (the so-called propositional algorithms). Existing multirelational classification approaches either “upgrade” mature propositional learning methods to deal with relational presentation or extensively “flatten” multiple tables into a single flat file, which is then solved by propositional algorithms. This article reports a multiple view strategy—where neither “upgrading” nor “flattening” is required—for mining in relational databases. Our approach learns from multiple views (feature set) of a relational databases, and then integrates the information acquired by individual view learners to construct a final model. Our empirical studies show that the method compares well in comparison with the classifiers induced by the majority of multirelational mining systems, in terms of accuracy obtained and running time needed. The paper explores the implications of this finding for multirelational research and applications. In addition, the method has practical significance: it is appropriate for directly mining many real-world databases.
Herna L. ViktorEmail:

A scalable, incremental learning algorithm for classification problems   总被引:5,自引:0,他引:5  
In this paper a novel data mining algorithm, Clustering and Classification Algorithm-Supervised (CCA-S), is introduced. CCA-S enables the scalable, incremental learning of a non-hierarchical cluster structure from training data. This cluster structure serves as a function to map the attribute values of new data to the target class of these data, that is, classify new data. CCA-S utilizes both the distance and the target class of training data points to derive the cluster structure. In this paper, we first present problems with many existing data mining algorithms for classification problems, such as decision trees, artificial neural networks, in scalable and incremental learning. We then describe CCA-S and discuss its advantages in scalable, incremental learning. The testing results of applying CCA-S to several common data sets for classification problems are presented. The testing results show that the classification performance of CCA-S is comparable to the other data mining algorithms such as decision trees, artificial neural networks and discriminant analysis.  相似文献   

Data gravitation based classification (DGC) is a novel data classification technique based on the concept of data gravitation. The basic principle of DGC algorithm is to classify data samples by comparing the data gravitation between the different data classes. In the DGC model, a kind of “force” called data gravitation between two data samples is computed. Data from the same class are combined as a result of gravitation. On the other hand, data gravitation between different data classes can be compared. A larger gravitation from a class means the data sample belongs to a particular class. One outstanding advantage of the DGC, in comparison with other classification algorithms is its simple classification principle with high performance. This makes the DGC algorithm much easier to be implemented. Feature selection plays an important role in classification problems and a novel feature selection algorithm is investigated based on the idea of DGC and weighted features. The proposed method is validated by using 12 well-known classification data sets from UCI machine learning repository. Experimental results illustrate that the proposed method is very efficient for data classification and feature selection.  相似文献   

Ensemble learning has attracted considerable attention owing to its good generalization performance. The main issues in constructing a powerful ensemble include training a set of diverse and accurate base classifiers, and effectively combining them. Ensemble margin, computed as the difference of the vote numbers received by the correct class and the another class received with the most votes, is widely used to explain the success of ensemble learning. This definition of the ensemble margin does not consider the classification confidence of base classifiers. In this work, we explore the influence of the classification confidence of the base classifiers in ensemble learning and obtain some interesting conclusions. First, we extend the definition of ensemble margin based on the classification confidence of the base classifiers. Then, an optimization objective is designed to compute the weights of the base classifiers by minimizing the margin induced classification loss. Several strategies are tried to utilize the classification confidences and the weights. It is observed that weighted voting based on classification confidence is better than simple voting if all the base classifiers are used. In addition, ensemble pruning can further improve the performance of a weighted voting ensemble. We also compare the proposed fusion technique with some classical algorithms. The experimental results also show the effectiveness of weighted voting with classification confidence.  相似文献   

Cancer diagnosis is an important emerging clinical application of microarray data. Its accurate prediction to the type or size of tumors relies on adopting powerful and reliable classification models, so as to patients can be provided with better treatment or response to therapy. However, the high dimensionality of microarray data may bring some disadvantages, such as over-fitting, poor performance and low efficiency, to traditional classification models. Thus, one of the challenging tasks in cancer diagnosis is how to identify salient expression genes from thousands of genes in microarray data that can directly contribute to the phenotype or symptom of disease. In this paper, we propose a new ensemble gene selection method (EGS) to choose multiple gene subsets for classification purpose, where the significant degree of gene is measured by conditional mutual information or its normalized form. After different gene subsets have been obtained by setting different starting points of the search procedure, they will be used to train multiple base classifiers and then aggregated into a consensus classifier by the manner of majority voting. The proposed method is compared with five popular gene selection methods on six public microarray datasets and the comparison results show that our method works well.  相似文献   

Coronary artery disease (CAD) is a condition in which the heart is not fed sufficiently as a result of the accumulation of fatty matter. As reported by the World Health Organization, around 32% of the total deaths in the world are caused by CAD, and it is estimated that approximately 23.6 million people will die from this disease in 2030. CAD develops over time, and the diagnosis of this disease is difficult until a blockage or a heart attack occurs. In order to bypass the side effects and high costs of the current methods, researchers have proposed to diagnose CADs with computer-aided systems, which analyze some physical and biochemical values at a lower cost. In this study, for the CAD diagnosis, (i) seven different computational feature selection (FS) methods, one domain knowledge-based FS method, and different classification algorithms have been evaluated; (ii) an exhaustive ensemble FS method and a probabilistic ensemble FS method have been proposed. The proposed approach is tested on three publicly available CAD data sets using six different classification algorithms and four different variants of voting algorithms. The performance metrics have been comparatively evaluated with numerous combinations of classifiers and FS methods. The multi-layer perceptron classifier obtained satisfactory results on three data sets. Performance evaluations show that the proposed approach resulted in 91.78%, 85.55%, and 85.47% accuracy for the Z-Alizadeh Sani, Statlog, and Cleveland data sets, respectively.  相似文献   

在动态的数据流中,由于其不稳定性以及存在概念漂移等问题,集成分类模型需要有及时适应新环境的能力.目前通常使用监督信息对基分类器的权重进行更新,以此来赋予符合当前环境的基分类器更高的权重,然而监督信息在真实数据流环境下无法立即获得.为了解决这个问题,文中提出了一种基于信息熵更新基分类器权重的数据流集成分类算法.首先使用随...  相似文献   

In this paper, we make a comparative study of the effectiveness of ensemble technique for sentiment classification. The ensemble framework is applied to sentiment classification tasks, with the aim of efficiently integrating different feature sets and classification algorithms to synthesize a more accurate classification procedure. First, two types of feature sets are designed for sentiment classification, namely the part-of-speech based feature sets and the word-relation based feature sets. Second, three well-known text classification algorithms, namely na?¨ve Bayes, maximum entropy and support vector machines, are employed as base-classifiers for each of the feature sets. Third, three types of ensemble methods, namely the fixed combination, weighted combination and meta-classifier combination, are evaluated for three ensemble strategies. A wide range of comparative experiments are conducted on five widely-used datasets in sentiment classification. Finally, some in-depth discussion is presented and conclusions are drawn about the effectiveness of ensemble technique for sentiment classification.  相似文献   

针对非平衡数据分类问题,提出了一种改进的SVM-KNN分类算法,在此基础上设计了一种集成学习模型.该模型采用限数采样方法对多数类样本进行分割,将分割后的多数类子簇与少数类样本重新组合,利用改进的SVM-KNN分别训练,得到多个基本分类器,对各个基本分类器进行组合.采用该模型对UCI数据集进行实验,结果显示该模型对于非平衡数据分类有较好的效果.  相似文献   

With the widespread usage of social networks, forums and blogs, customer reviews emerged as a critical factor for the customers’ purchase decisions. Since the beginning of 2000s, researchers started to focus on these reviews to automatically categorize them into polarity levels such as positive, negative, and neutral. This research problem is known as sentiment classification. The objective of this study is to investigate the potential benefit of multiple classifier systems concept on Turkish sentiment classification problem and propose a novel classification technique. Vote algorithm has been used in conjunction with three classifiers, namely Naive Bayes, Support Vector Machine (SVM), and Bagging. Parameters of the SVM have been optimized when it was used as an individual classifier. Experimental results showed that multiple classifier systems increase the performance of individual classifiers on Turkish sentiment classification datasets and meta classifiers contribute to the power of these multiple classifier systems. The proposed approach achieved better performance than Naive Bayes, which was reported the best individual classifier for these datasets, and Support Vector Machines. Multiple classifier systems (MCS) is a good approach for sentiment classification, and parameter optimization of individual classifiers must be taken into account while developing MCS-based prediction systems.  相似文献   

盛高斌  姚明海 《计算机仿真》2009,26(10):198-201,318
为了提高小数据量的有标记样本问题中学习器的性能,结合半监督学习和选择性集成学习,提出了基于半监督回归的选择性集成算法SSRES。算法基于半监督学习的基本思想,同时使用有标记样本和未标记样本训练学习器从而减少对有标记样本的需求,使用选择性集成算法GRES对不同学习器进行适当的选择,并将选择的结果结合提高学习器的泛化能力。实验结果表明,在小数据量的有标记样本问题中,该算法能够有效地提高学习器的性能。  相似文献   

Ant colony optimization (ACO) algorithms have been successfully applied in data classification, which aim at discovering a list of classification rules. However, due to the essentially random search in ACO algorithms, the lists of classification rules constructed by ACO-based classification algorithms are not fixed and may be distinctly different even using the same training set. Those differences are generally ignored and some beneficial information cannot be dug from the different data sets, which may lower the predictive accuracy. To overcome this shortcoming, this paper proposes a novel classification rule discovery algorithm based on ACO, named AntMinermbc, in which a new model of multiple rule sets is presented to produce multiple lists of rules. Multiple base classifiers are built in AntMinermbc, and each base classifier is expected to remedy the weakness of other base classifiers, which can improve the predictive accuracy by exploiting the useful information from various base classifiers. A new heuristic function for ACO is also designed in our algorithm, which considers both of the correlation and coverage for the purpose to avoid deceptive high accuracy. The performance of our algorithm is studied experimentally on 19 publicly available data sets and further compared to several state-of-the-art classification approaches. The experimental results show that the predictive accuracy obtained by our algorithm is statistically higher than that of the compared targets.  相似文献   

提出了一种基于关联规则的多类标算法(MLAC).利用多类标FP-tree来分解组合生成多类标规则.并通过组合多重关联规则分类器进行分类预测,降低了由高维属性带来的高计算复杂度,有效地提高了算法的性能和效率.针对多类标数据集的实验结果表明,MLAC算法在性能和效率等方面均优干ML-KNN等多类标分类算法.  相似文献   

Credit scoring aims to assess the risk associated with lending to individual consumers. Recently, ensemble classification methodology has become popular in this field. However, most researches utilize random sampling to generate training subsets for constructing the base classifiers. Therefore, their diversity is not guaranteed, which may lead to a degradation of overall classification performance. In this paper, we propose an ensemble classification approach based on supervised clustering for credit scoring. In the proposed approach, supervised clustering is employed to partition the data samples of each class into a number of clusters. Clusters from different classes are then pairwise combined to form a number of training subsets. In each training subset, a specific base classifier is constructed. For a sample whose class label needs to be predicted, the outputs of these base classifiers are combined by weighted voting. The weight associated with a base classifier is determined by its classification performance in the neighborhood of the sample. In the experimental study, two benchmark credit data sets are adopted for performance evaluation, and an industrial case study is conducted. The results show that compared to other ensemble classification methods, the proposed approach is able to generate base classifiers with higher diversity and local accuracy, and improve the accuracy of credit scoring.  相似文献   

In many medical applications, data are taken from paired organs or from repeated measurements of the same organ or subject. Subject based as opposed to observation based evaluation of these data results in increased efficiency of the estimation of the misclassification rate. A subject based approach for classification in the generation of bootstrap samples of bagging and bundling methods is analyzed. A simulation model is used to compare the performance of different strategies to create the bootstrap samples which are used to grow individual trees. The proposed approach is compared to linear discriminant analysis, logistic regression, random forests and gradient boosting. Finally, the simulation results are applied to glaucoma diagnosis using both eyes of glaucoma patients and healthy controls. It is demonstrated that the proposed subject based resampling reduces the misclassification rate.  相似文献   

To build a successful customer churn prediction model, a classification algorithm should be chosen that fulfills two requirements: strong classification performance and a high level of model interpretability. In recent literature, ensemble classifiers have demonstrated superior performance in a multitude of applications and data mining contests. However, due to an increased complexity they result in models that are often difficult to interpret. In this study, GAMensPlus, an ensemble classifier based upon generalized additive models (GAMs), in which both performance and interpretability are reconciled, is presented and evaluated in a context of churn prediction modeling. The recently proposed GAMens, based upon Bagging, the Random Subspace Method and semi-parametric GAMs as constituent classifiers, is extended to include two instruments for model interpretability: generalized feature importance scores, and bootstrap confidence bands for smoothing splines. In an experimental comparison on data sets of six real-life churn prediction projects, the competitive performance of the proposed algorithm over a set of well-known benchmark algorithms is demonstrated in terms of four evaluation metrics. Further, the ability of the technique to deliver valuable insight into the drivers of customer churn is illustrated in a case study on data from a European bank. Firstly, it is shown how the generalized feature importance scores allow the analyst to identify the relative importance of churn predictors in function of the criterion that is used to measure the quality of the model predictions. Secondly, the ability of GAMensPlus to identify nonlinear relationships between predictors and churn probabilities is demonstrated.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号