期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Gene selection and prediction for cancer classification using support vector machines with a reject option

Hosik ChoiSunghoon Kwon Yongdai Kim 《Computational statistics & data analysis》2011,55(5):1897-1908

In cancer classification based on gene expression data, it would be desirable to defer a decision for observations that are difficult to classify. For instance, an observation for which the conditional probability of being cancer is around 1/2 would preferably require more advanced tests rather than an immediate decision. This motivates the use of a classifier with a reject option that reports a warning in cases of observations that are difficult to classify. In this paper, we consider a problem of gene selection with a reject option. Typically, gene expression data comprise of expression levels of several thousands of candidate genes. In such cases, an effective gene selection procedure is necessary to provide a better understanding of the underlying biological system that generates data and to improve prediction performance. We propose a machine learning approach in which we apply the l₁ penalty to the SVM with a reject option. This method is referred to as the l₁ SVM with a reject option. We develop a novel optimization algorithm for this SVM, which is sufficiently fast and stable to analyze gene expression data. The proposed algorithm realizes an entire solution path with respect to the regularization parameter. Results of numerical studies show that, in comparison with the standard l₁ SVM, the proposed method efficiently reduces prediction errors without hampering gene selectivity. 相似文献

2.

A flexible method for software effort estimation by analogy 总被引：2，自引：2，他引：0

Jingzhou Li Guenther Ruhe Ahmed Al-Emran Michael M. Richter 《Empirical Software Engineering》2007,12(1):65-106

Effort estimation by analogy uses information from former similar projects to predict the effort for a new project. Existing analogy-based methods are limited by their inability to handle non-quantitative data and missing values. The accuracy of predictions needs improvement as well. In this paper, we propose a new flexible method called AQUA that is able to overcome the limitations of former methods. AQUA combines ideas from two known analogy-based estimation techniques: case-based reasoning and collaborative filtering. The method is applicable to predict effort related to any object at the requirement, feature, or project levels. Which are the main contributions of AQUA when compared to other methods? First, AQUA supports non-quantitative data by defining similarity measures for different data types. Second, it is able to tolerate missing values. Third, the results from an explorative study in this paper shows that the prediction accuracy is sensitive to both the number N of analogies (similar objects) taken for adaptation and the threshold T for the degree of similarity, which is true especially for larger data sets. A fixed and small number of analogies, as assumed in existing analogy-based methods, may not produce the best accuracy of prediction. Fourth, a flexible mechanism based on learning of existing data is proposed for determining the appropriate values of N and T likely to offer the best accuracy of prediction. New criteria to measure the quality of prediction are proposed. AQUA was validated against two internal and one public domain data sets with non-quantitative attributes and missing values. The obtained results are encouraging. In addition, acomparative analysis with existing analogy-based estimation methods was conducted using three publicly available data sets that were used by these methods. Intwo of the three cases, AQUA outperformed all other methods. 相似文献

3.

Parkinson’s disease prediction using gene expression – A projection based learning meta-cognitive neural classifier approach

G. Sateesh Babu S. Suresh 《Expert systems with applications》2013,40(5):1519-1529

In this paper, we propose a gene expression based approach for the prediction of Parkinson’s disease (PD) using ‘projection based learning for meta-cognitive radial basis function network (PBL-McRBFN)’. McRBFN is inspired by human meta-cognitive learning principles. McRBFN has two components, a cognitive component and a meta-cognitive component. The cognitive component is a radial basis function network with evolving architecture. In the cognitive component, the PBL algorithm computes the optimal output weights with least computational effort. The meta-cognitive component controls the learning process in the cognitive component by choosing the best learning strategy for the current sample and adapts the learning strategies by implementing self-regulation. The interaction of cognitive component and meta-cognitive component address the what-to-learn, when-to-learn and how-to-learn of human learning principles efficiently.PBL-McRBFN classifier is used to predict PD using micro-array gene expression data obtained from ParkDB database. The performance of PBL-McRBFN classifier has been evaluated using Independent Component Analysis (ICA) reduced features sets from the complete genes and selected genes with two different significance levels. Further, the performance of PBL-McRBFN classifier is statistically compared with existing classifiers using one-way repeated ANOVA test. Further, it is also used in PD prediction using the standard vocal and gait PD data sets. In all these data sets, the performance of PBL-McRBFN is compared against existing results in the literature. Performance results clearly highlight the superior performance of our proposed approach. 相似文献

4.

基于文件粒度的多目标软件缺陷预测方法实证研究

陈翔赵英全顾庆倪超王赞《软件学报》2019,30(12):3694-3713

软件缺陷预测技术通过挖掘和分析软件库训练出软件缺陷预测模型,随后利用该模型来预测出被测软件项目内的缺陷程序模块,因此可以有效地优化测试资源的分配.在基于代价感知的评测指标下,有监督学习方法与无监督学习方法之间的预测性能比较是最近的一个热门研究话题.其中在基于文件粒度的缺陷预测问题中,Yan等人最近对Yang等人考虑的无监督学习方法和有监督学习方法展开了大规模实证研究,结果表明存在一些无监督学习方法,其性能要优于有监督方法.基于来自开源社区的10个项目展开了实证研究.结果表明：在同项目缺陷预测场景中,若基于ACC评测指标,MULTI方法与最好的无监督方法和有监督方法相比,其预测性能平均有105.81%和123.84%的提高;若基于P_OPT评测指标,MULTI方法与最好的无监督方法和有监督方法相比,其预测性能平均有35.61%和38.70%的提高.在跨项目缺陷预测场景中,若基于ACC评测指标,MULTI方法与最好的无监督方法和有监督方法相比,其预测性能平均有22.42%和34.95%的提高.若基于P_OPT评测指标,MULTI方法与最好的无监督方法和有监督方法相比,其预测性能平均有11.45%和17.92%的提高.同时,基于Huang等人提出的PMI和IFA评测指标,MULTI方法的表现与代价感知的指标相比存在一定的折衷问题,但仍好于在ACC和P_OPT评测指标下表现最好的两种无监督学习方法.除此之外,将MULTI方法与最新提出的OneWay和CBS方法进行了比较,结果表明,MULTI方法在性能上仍然可以显著优于这两种方法.同时,基于F1评测指标的结果也验证了MULTI方法在预测性能上的显著优越性.最后,通过分析模型构建的时间开销,表明MULTI方法的模型构建开销对开发人员来说处于可接受的范围之内. 相似文献

5.

Combined classifier for cross-project defect prediction: an extended empirical study

Yun ZHANG David LO Xin XIA Jianling SUN 《Frontiers of Computer Science》2018,12(2):280-296

To facilitate developers in effective allocation of their testing and debugging efforts, many software defect prediction techniques have been proposed in the literature. These techniques can be used to predict classes that are more likely to be buggy based on the past history of classes, methods, or certain other code elements. These techniques are effective provided that a sufficient amount of data is available to train a prediction model. However, sufficient training data are rarely available for new software projects. To resolve this problem, cross-project defect prediction, which transfers a prediction model trained using data from one project to another, was proposed and is regarded as a new challenge in the area of defect prediction. Thus far, only a few cross-project defect prediction techniques have been proposed. To advance the state of the art, in this study, we investigated seven composite algorithms that integrate multiple machine learning classifiers to improve cross-project defect prediction. To evaluate the performance of the composite algorithms, we performed experiments on 10 open-source software systems from the PROMISE repository, which contain a total of 5,305 instances labeled as defective or clean. We compared the composite algorithms with the combined defect predictor where logistic regression is used as the meta classification algorithm (CODEP_Logistic), which is the most recent cross-project defect prediction algorithm in terms of two standard evaluation metrics: cost effectiveness and F-measure. Our experimental results show that several algorithms outperform CODEP_Logistic: Maximum voting shows the best performance in terms of F-measure and its average F-measure is superior to that of CODEP_Logistic by 36.88%. Bootstrap aggregation (Bagging_J48) shows the best performance in terms of cost effectiveness and its average cost effectiveness is superior to that of CODEP_Logistic by 15.34%. 相似文献

6.

Cascade of genetic algorithm and decision tree for cancer classification on gene expression data

Jinn‐Yi Yeh Tai‐Hsi Wu 《Expert Systems》2010,27(3):201-218

Abstract: Cancer classification, through gene expression data analysis, has produced remarkable results, and has indicated that gene expression assays could significantly aid in the development of efficient cancer diagnosis and classification platforms. However, cancer classification, based on DNA array data, remains a difficult problem. The main challenge is the overwhelming number of genes relative to the number of training samples, which implies that there are a large number of irrelevant genes to be dealt with. Another challenge is from the presence of noise inherent in the data set. It makes accurate classification of data more difficult when the sample size is small. We apply genetic algorithms (GAs) with an initial solution provided by t statistics, called t‐GA, for selecting a group of relevant genes from cancer microarray data. The decision‐tree‐based cancer classifier is built on the basis of these selected genes. The performance of this approach is evaluated by comparing it to other gene selection methods using publicly available gene expression data sets. Experimental results indicate that t‐GA has the best performance among the different gene selection methods. The Z‐score figure also shows that some genes are consistently preferentially chosen by t‐GA in each data set. 相似文献

7.

Simultaneous cancer classification and gene selection with Bayesian nearest neighbor method: An integrated approach

Sounak Chakraborty 《Computational statistics & data analysis》2009,53(4):1462-1474

Since most cancer treatments come with a certain degree of toxicity it is very essential to identify a cancer type correctly and then administer the relevant therapy. With the arrival of powerful tools such as gene expression microarrays the cancer classification basis is slowly changing from morphological properties to molecular signatures. Several recent studies have demonstrated a marked improvement in prediction accuracy of tumor types based on gene expression microarray measurements over clinical markers. The main challenge in working with gene expression microarrays is that there is a huge number of genes to work with. Out of them only a small fraction are actually relevant for differentiating between different types of cancer. A Bayesian nearest neighbor model equipped with an integrated variable selection technique is proposed to overcome this challenge. This classification and gene selection model is able to classify different cancer types accurately and simultaneously identify the relevant or important genes. The proposed model is completely automatic in the sense that it adaptively picks up the neighborhood size and the important covariates. The method is successfully applied to three simulated data sets and four well known real data sets. To demonstrate the competitiveness of the method a comparative study is also done with several other “off the shelf” popular classification methods. For all the simulated data sets and real life data sets, the proposed method produced highly competitive if not better results. While the standard approach is two step model building for gene selection and then tumor prediction, this novel adaptive gene selection technique automatically selects the relevant genes along with tumor class prediction in one go. The biological relevance of the selected genes are also discussed to validate the claim. 相似文献

8.

Balancing type one and two errors in multiple testing for differential expression of genes

Alexander Gordon Galina Glazko 《Computational statistics & data analysis》2009,53(5):1622-1629

A new procedure is proposed to balance type I and II errors in significance testing for differential expression of individual genes. Suppose that a collection, F_k, of k lists of selected genes is available, each of them approximating by their content the true set of differentially expressed genes. For example, such sets can be generated by a subsampling counterpart of the delete-d-jackknife method controlling the per-comparison error rate for each subsample. A final list of candidate genes, denoted by S^∗, is composed in such a way that its contents be closest in some sense to all the sets thus generated. To measure “closeness” of gene lists, we introduce an asymmetric distance between sets with its asymmetry arising from a generally unequal assignment of the relative costs of type I and type II errors committed in the course of gene selection. The optimal set S^∗ is defined as a minimizer of the average asymmetric distance from an arbitrary set S to all sets in the collection F_k. The minimization problem can be solved explicitly, leading to a frequency criterion for the inclusion of each gene in the final set. The proposed method is tested by resampling from real microarray gene expression data with artificially introduced shifts in expression levels of pre-defined genes, thereby mimicking their differential expression. 相似文献

9.

Developing a continental-scale measure of gross primary production by combining MODIS and AmeriFlux data through Support Vector Machine approach

Feihua Yang Kazuhito Ichii Michael A. White Andrew R. Michaelis Petr Votava A-Xing Zhu Alfredo Huete Ramakrishna R. Nemani 《Remote sensing of environment》2007,110(1):109-122

Remote sensing is a potentially powerful technology with which to extrapolate eddy covariance-based gross primary production (GPP) to continental scales. In support of this concept, we used meteorological and flux data from the AmeriFlux network and Support Vector Machine (SVM), an inductive machine learning technique, to develop and apply a predictive GPP model for the conterminous U.S. In the following four-step process, we first trained the SVM to predict flux-based GPP from 33 AmeriFlux sites between 2000 and 2003 using three remotely-sensed variables (land surface temperature, enhanced vegetation index (EVI), and land cover) and one ground-measured variable (incident shortwave radiation). Second, we evaluated model performance by predicting GPP for 24 available AmeriFlux sites in 2004. In this independent evaluation, the SVM predicted GPP with a root mean squared error (RMSE) of 1.87 gC/m²/day and an R² of 0.71. Based on annual total GPP at 15 AmeriFlux sites for which the number of 8-day averages in 2004 was no less than 67% (30 out of a possible 45), annual SVM GPP prediction error was 32.1% for non-forest ecosystems and 22.2% for forest ecosystems, while the standard Moderate Resolution Imaging Spectroradiometer GPP product (MOD17) had an error of 50.3% for non-forest ecosystems and 21.5% for forest ecosystems, suggesting that the regionally tuned SVM performed better than the standard global MOD17 GPP for non-forest ecosystems but had similar performance for forest ecosystems. The most important explanatory factor for GPP prediction was EVI, removal of which increased GPP RMSE by 0.85 gC/m²/day in a cross-validation experiment. Third, using the SVM driven by remote sensing data including incident shortwave radiation, we predicted 2004 conterminous U.S. GPP and found that results were consistent with expected spatial and temporal patterns. Finally, as an illustration of SVM GPP for ecological applications, we estimated maximum light use efficiency (e_max), one of the most important factors for standard light use efficiency models, for the conterminous U.S. by integrating the 2004 SVM GPP with the MOD17 GPP algorithm. We found that e_max varied from ∼ 0.86 gC/MJ in grasslands to ∼ 1.56 gC/MJ in deciduous forests, while MOD17 e_max was 0.68 gC/MJ for grasslands and 1.16 gC/MJ for deciduous forests, suggesting that refinements of MOD17 e_max may be beneficial. 相似文献

10.

Optimization of Robust Loss Functions for Weakly-Labeled Image Taxonomies

Julian J. McAuley Arnau Ramisa Tibério S. Caetano 《International Journal of Computer Vision》2013,104(3):343-361

The recently proposed ImageNet dataset consists of several million images, each annotated with a single object category. These annotations may be imperfect, in the sense that many images contain multiple objects belonging to the label vocabulary. In other words, we have a multi-label problem but the annotations include only a single label (which is not necessarily the most prominent). Such a setting motivates the use of a robust evaluation measure, which allows for a limited number of labels to be predicted and, so long as one of the predicted labels is correct, the overall prediction should be considered correct. This is indeed the type of evaluation measure used to assess algorithm performance in a recent competition on ImageNet data. Optimizing such types of performance measures presents several hurdles even with existing structured output learning methods. Indeed, many of the current state-of-the-art methods optimize the prediction of only a single output label, ignoring this ‘structure’ altogether. In this paper, we show how to directly optimize continuous surrogates of such performance measures using structured output learning techniques with latent variables. We use the output of existing binary classifiers as input features in a new learning stage which optimizes the structured loss corresponding to the robust performance measure. We present empirical evidence that this allows us to ‘boost’ the performance of binary classification on a variety of weakly-supervised labeling problems defined on image taxonomies. 相似文献

11.

Toward breast cancer survivability prediction models through improving training space

Jaree Thongkam Guandong Xu Yanchun Zhang Fuchun Huang 《Expert systems with applications》2009,36(10):12200-12209

Due to the difficulties of outlier and skewed data, the prediction of breast cancer survivability has presented many challenges in the field of data mining and pattern precognition, especially in medical research. To solve these problems, we have proposed a hybrid approach to generating higher quality data sets in the creation of improved breast cancer survival prediction models. This approach comprises two main steps: (1) utilization of an outlier filtering approach based on C-Support Vector Classification (C-SVC) to identify and eliminate outlier instances; and (2) application of an over-sampling approach using over-sampling with replacement to increase the number of instances in the minority class. In order to assess the capability and effectiveness of the proposed approach, several measurement methods including basic performance (e.g., accuracy, sensitivity, and specificity), Area Under the receiver operating characteristic Curve (AUC) and F-measure were utilized. Moreover, a 10-fold cross-validation method was used to reduce the bias and variance of the results of breast cancer survivability prediction models. Results have indicated that the proposed approach leads to improving the performance of breast cancer survivability prediction models by up to 28.34% due to the improved training data space. 相似文献

12.

Cancer classification using gene expression data

《Information Systems》2003,28(4):243-268

The classification of different tumor types is of great importance in cancer diagnosis and drug discovery. However, most previous cancer classification studies are clinical based and have limited diagnostic ability. Cancer classification using gene expression data is known to contain the keys for addressing the fundamental problems relating to cancer diagnosis and drug discovery. The recent advent of DNA microarray technique has made simultaneous monitoring of thousands of gene expressions possible. With this abundance of gene expression data, researchers have started to explore the possibilities of cancer classification using gene expression data. Quite a number of methods have been proposed in recent years with promising results. But there are still a lot of issues which need to be addressed and understood.In order to gain a deep insight into the cancer classification problem, it is necessary to take a closer look at the problem, the proposed solutions and the related issues all together. In this survey paper, we present a comprehensive overview of various proposed cancer classification methods and evaluate them based on their computation time, classification accuracy and ability to reveal biologically meaningful gene information. We also introduce and evaluate various proposed gene selection methods which we believe should be an integral preprocessing step for cancer classification. In order to obtain a full picture of cancer classification, we also discuss several issues related to cancer classification, including the biological significance vs. statistical significance of a cancer classifier, the asymmetrical classification errors for cancer classifiers, and the gene contamination problem. 相似文献

13.

Using OVA modeling to improve classification performance for large datasets

Patricia E.N. Lutu Andries P. Engelbrecht 《Expert systems with applications》2012,39(4):4358-4376

One-Versus-All (OVA) classification is a classifier construction method where a k-class prediction task is decomposed into k 2-class sub-problems. One base model is constructed for each sub-problem and the base models are then combined into one model. Aggregate model implementation is the process of constructing several base models which are then combined into a single model for prediction. In essence, OVA classification is a method of aggregate modeling. This paper reports studies that were conducted to establish whether OVA classification can provide predictive performance gains when large volumes of data are available for modeling as is commonly the case in data mining. It is demonstrated in this paper that firstly, OVA modeling can be used to increase the amount of training data while at the same time using base model training sets whose size is much smaller than the total amount of available training data. Secondly, OVA models created from large datasets provide a higher level of predictive performance compared to single k-class models. Thirdly, the use of boosted OVA base models can provide higher predictive performance compared to un-boosted OVA base models. Fourthly, when the combination algorithm for base model predictions is able to resolve tied predictions, the resulting aggregate models provide a higher level of predictive performance. 相似文献

14.

Measuring firm performance using financial ratios: A decision tree approach

Dursun Delen Cemil Kuzey Ali Uyar 《Expert systems with applications》2013,40(10):3970-3983

Determining the firm performance using a set of financial measures/ratios has been an interesting and challenging problem for many researchers and practitioners. Identification of factors (i.e., financial measures/ratios) that can accurately predict the firm performance is of great interest to any decision maker. In this study, we employed a two-step analysis methodology: first, using exploratory factor analysis (EFA) we identified (and validated) underlying dimensions of the financial ratios, followed by using predictive modeling methods to discover the potential relationships between the firm performance and financial ratios. Four popular decision tree algorithms (CHAID, C5.0, QUEST and C&RT) were used to investigate the impact of financial ratios on firm performance. After developing prediction models, information fusion-based sensitivity analyses were performed to measure the relative importance of independent variables. The results showed the CHAID and C5.0 decision tree algorithms produced the best prediction accuracy. Sensitivity analysis results indicated that Earnings Before Tax-to-Equity Ratio and Net Profit Margin are the two most important variables. 相似文献

15.

Quantitative analysis of salt-affected soil reflectance spectra: A comparison of two adaptive methods (PLSR and ANN) 总被引：15，自引：0，他引：15

J. Farifteh F. Van der Meer E.J.M. Carranza 《Remote sensing of environment》2007,110(1):59-78

In this paper the possibility of predicting salt concentrations in soils from measured reflectance spectra is studied using partial least squares regression (PLSR) and artificial neural network (ANN). Performance of these two adaptive methods has been compared in order to examine linear and non-linear relationship between soil reflectance and salt concentration.Experiment-, field- and image-scale data sets were prepared consisting of soil EC measurements (dependent variable) and their corresponding reflectance spectra (independent variables). For each data set, PLSR and ANN predictive models of soil salinity were developed based on soil reflectance data. The predictive accuracies of PLSR and ANN models were assessed against independent validation data sets not included in the calibration or training phase.The results of PLSR analyses suggest that an accurate to good prediction of EC can be made based on models developed from experiment-scale data (R² > 0.81 and RPD (ratio of prediction to deviation) > 2.1) for soil samples salinized by bischofite and epsomite minerals. For field-scale data sets, the PLSR predictive models provided approximate quantitative EC estimations (R² = 0.8 and RPD = 2.2) for grids 1 and 6 and poor estimations for grids 2, 3, 4 and 5. The salinity predictions from image-scale data sets by PLSR models were very reliable to good (R² between 0.86 and 0.94 and RPD values between 2.6 and 4.1) except for sub-image 2 (R² = 0.61 and RPD = 1.2).The ANN models from experiment-scale data set revealed similar network performances for training, validation and test data sets indicating a good network generalization for samples salinized by bischofite and epsomite minerals. The RPD and the R² between reference measurements and ANN outputs of theses models suggest an accurate to good prediction of soil salinity (R² > 0.92 and RPD > 2.3). For the field-scale data set, prediction accuracy is relatively poor (0.69 > R² > 0.42). The ANN predictive models estimating soil salinity from image-scale data sets indicate a good prediction (R² > 0.86 and RPD > 2.5) except for sub-image 2 (R² = 0.6 and RPD = 1.2).The results of this study show that both methods have a great potential for estimating and mapping soil salinity. Performance indexes from both methods suggest large similarity between the two approaches with PLSR advantages. This indicates that the relation between soil salinity and soil reflectance can be approximated by a linear function. 相似文献

16.

Two-level and hybrid ensembles of decision trees for high performance concrete compressive strength prediction

Halil Ibrahim Erdal 《Engineering Applications of Artificial Intelligence》2013,26(7):1689-1697

Accurate prediction of high performance concrete (HPC) compressive strength is very important issue. In the last decade, a variety of modeling approaches have been developed and applied to predict HPC compressive strength from a wide range of variables, with varying success. The selection, application and comparison of decent modeling methods remain therefore a crucial task, subject to ongoing researches and debates. This study proposes three different ensemble approaches: (i) single ensembles of decision trees (DT) (ii) two-level ensemble approach which employs same ensemble learning method twice in building ensemble models (iii) hybrid ensemble approach which is an integration of attribute-base ensemble method (random sub-spaces RS) and instance-base ensemble methods (bagging Bag, stochastic gradient boosting GB). A decision tree is used as the base learner of ensembles and its results are benchmarked to proposed ensemble models. The obtained results show that the proposed ensemble models could noticeably advance the prediction accuracy of the single DT model and for determining average determination of correlation, the best models for HPC compressive strength forecasting are GB–RS DT, RS–GB DT and GB–GB DT among the eleven proposed predictive models, respectively. The obtained results show that the proposed ensemble models could noticeably advance the prediction accuracy of the single DT model and for determining determination of correlation (R²_max), the best models for HPC compressive strength forecasting are GB–RS DT (R²=0.9520), GB–GB DT (R²=0.9456) and Bag–Bag DT (R²=0.9368) among the eleven proposed predictive models, respectively. 相似文献

17.

Y. Arzhaeva Author Vitae D.M.J. Tax^{Author Vitae} 《Pattern recognition》2009,42(9):1768-1776

In this paper classification on dissimilarity representations is applied to medical imaging data with the task of discrimination between normal images and images with signs of disease. We show that dissimilarity-based classification is a beneficial approach in dealing with weakly labeled data, i.e. when the location of disease in an image is unknown and therefore local feature-based classifiers cannot be trained. A modification to the standard dissimilarity-based approach is proposed that makes a dissimilarity measure multi-valued, hence, able to retain more information. A multi-valued dissimilarity between an image and a prototype becomes an image representation vector in classification. Several classification outputs with respect to different prototypes are further integrated into a final image decision. Both standard and proposed methods are evaluated on data sets of chest radiographs with textural abnormalities and compared to several feature-based region classification approaches applied to the same data. On a tuberculosis data set the multi-valued dissimilarity-based classification performs as well as the best region classification method applied to the fully labeled data, with an area under the receiver operating characteristic (ROC) curve (A_z) of 0.82. The standard dissimilarity-based classification yields A_z=0.80. On a data set with interstitial abnormalities both dissimilarity-based approaches achieve A_z=0.98 which is closely behind the best region classification method. 相似文献

18.

Empirical characterization of random forest variable importance measures 总被引：2，自引：0，他引：2

Kellie J. Archer Ryan V. Kimes 《Computational statistics & data analysis》2008,52(4):2249-2260

Microarray studies yield data sets consisting of a large number of candidate predictors (genes) on a small number of observations (samples). When interest lies in predicting phenotypic class using gene expression data, often the goals are both to produce an accurate classifier and to uncover the predictive structure of the problem. Most machine learning methods, such as k-nearest neighbors, support vector machines, and neural networks, are useful for classification. However, these methods provide no insight regarding the covariates that best contribute to the predictive structure. Other methods, such as linear discriminant analysis, require the predictor space be substantially reduced prior to deriving the classifier. A recently developed method, random forests (RF), does not require reduction of the predictor space prior to classification. Additionally, RF yield variable importance measures for each candidate predictor. This study examined the effectiveness of RF variable importance measures in identifying the true predictor among a large number of candidate predictors. An extensive simulation study was conducted using 20 levels of correlation among the predictor variables and 7 levels of association between the true predictor and the dichotomous response. We conclude that the RF methodology is attractive for use in classification problems when the goals of the study are to produce an accurate classifier and to provide insight regarding the discriminative ability of individual predictor variables. Such goals are common among microarray studies, and therefore application of the RF methodology for the purpose of obtaining variable importance measures is demonstrated on a microarray data set. 相似文献

19.

Towards an ensemble based system for predicting the number of software faults

《Expert systems with applications》2017

Software fault prediction using different techniques has been done by various researchers previously. It is observed that the performance of these techniques varied from dataset to dataset, which make them inconsistent for fault prediction in the unknown software project. On the other hand, use of ensemble method for software fault prediction can be very effective, as it takes the advantage of different techniques for the given dataset to come up with better prediction results compared to individual technique. Many works are available on binary class software fault prediction (faulty or non-faulty prediction) using ensemble methods, but the use of ensemble methods for the prediction of number of faults has not been explored so far. The objective of this work is to present a system using the ensemble of various learning techniques for predicting the number of faults in given software modules. We present a heterogeneous ensemble method for the prediction of number of faults and use a linear combination rule and a non-linear combination rule based approaches for the ensemble. The study is designed and conducted for different software fault datasets accumulated from the publicly available data repositories. The results indicate that the presented system predicted number of faults with higher accuracy. The results are consistent across all the datasets. We also use prediction at level l (Pred(l)), and measure of completeness to evaluate the results. Pred(l) shows the number of modules in a dataset for which average relative error value is less than or equal to a threshold value l. The results of prediction at level l analysis and measure of completeness analysis have also confirmed the effectiveness of the presented system for the prediction of number of faults. Compared to the single fault prediction technique, ensemble methods produced improved performance for the prediction of number of software faults. Main impact of this work is to allow better utilization of testing resources helping in early and quick identification of most of the faults in the software system. 相似文献

20.

Quantum clustering-based weighted linear programming support vector regression for multivariable nonlinear problem

Yanfang Yu Feng Qian Huimin Liu 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2010,14(9):921-929

Linear programming support vector regression shows improved reliability and generates sparse solution, compared with standard support vector regression. We present the v-linear programming support vector regression approach based on quantum clustering and weighted strategy to solve the multivariable nonlinear regression problem. First, the method applied quantum clustering to variable selection, introduced inertia weight, and took prediction precision of v-linear programming support vector regression as evaluation criteria, which effectively removed redundancy feature attributes and also reduced prediction error and support vectors. Second, it proposed a new weighted strategy due to each data point having different influence on regression model and determined the weighted parameter p in terms of distribution of training error, which greatly improved the generalization approximate ability. Experimental results demonstrated that the proposed algorithm enabled the mean squared error of test sets of Boston housing, Bodyfat, Santa dataset to, respectively, decrease by 23.18, 78.52, and 41.39%, and also made support vectors degrade rapidly, relative to the original v-linear programming support vector regression method. In contrast with other methods exhibited in the relevant literatures, the present algorithm achieved better generalization performance. 相似文献