首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Since most cancer treatments come with a certain degree of toxicity it is very essential to identify a cancer type correctly and then administer the relevant therapy. With the arrival of powerful tools such as gene expression microarrays the cancer classification basis is slowly changing from morphological properties to molecular signatures. Several recent studies have demonstrated a marked improvement in prediction accuracy of tumor types based on gene expression microarray measurements over clinical markers. The main challenge in working with gene expression microarrays is that there is a huge number of genes to work with. Out of them only a small fraction are actually relevant for differentiating between different types of cancer. A Bayesian nearest neighbor model equipped with an integrated variable selection technique is proposed to overcome this challenge. This classification and gene selection model is able to classify different cancer types accurately and simultaneously identify the relevant or important genes. The proposed model is completely automatic in the sense that it adaptively picks up the neighborhood size and the important covariates. The method is successfully applied to three simulated data sets and four well known real data sets. To demonstrate the competitiveness of the method a comparative study is also done with several other “off the shelf” popular classification methods. For all the simulated data sets and real life data sets, the proposed method produced highly competitive if not better results. While the standard approach is two step model building for gene selection and then tumor prediction, this novel adaptive gene selection technique automatically selects the relevant genes along with tumor class prediction in one go. The biological relevance of the selected genes are also discussed to validate the claim.  相似文献   

2.
A hybrid Huberized support vector machine (HHSVM) with an elastic-net penalty has been developed for cancer tumor classification based on thousands of gene expression measurements. In this paper, we develop a Bayesian formulation of the hybrid Huberized support vector machine for binary classification. For the coefficients of the linear classification boundary, we propose a new type of prior, which can select variables and group them together simultaneously. Our proposed prior is a scale mixture of normal distributions and independent gamma priors on a transformation of the variance of the normal distributions. We establish a direct connection between the Bayesian HHSVM model with our special prior and the standard HHSVM solution with the elastic-net penalty. We propose a hierarchical Bayes technique and an empirical Bayes technique to select the penalty parameter. In the hierarchical Bayes model, the penalty parameter is selected using a beta prior. For the empirical Bayes model, we estimate the penalty parameter by maximizing the marginal likelihood. The proposed model is applied to two simulated data sets and three real-life gene expression microarray data sets. Results suggest that our Bayesian models are highly successful in selecting groups of similarly behaved important genes and predicting the cancer class. Most of the genes selected by our models have shown strong association with well-studied genetic pathways, further validating our claims.  相似文献   

3.
Abstract: Cancer classification, through gene expression data analysis, has produced remarkable results, and has indicated that gene expression assays could significantly aid in the development of efficient cancer diagnosis and classification platforms. However, cancer classification, based on DNA array data, remains a difficult problem. The main challenge is the overwhelming number of genes relative to the number of training samples, which implies that there are a large number of irrelevant genes to be dealt with. Another challenge is from the presence of noise inherent in the data set. It makes accurate classification of data more difficult when the sample size is small. We apply genetic algorithms (GAs) with an initial solution provided by t statistics, called t‐GA, for selecting a group of relevant genes from cancer microarray data. The decision‐tree‐based cancer classifier is built on the basis of these selected genes. The performance of this approach is evaluated by comparing it to other gene selection methods using publicly available gene expression data sets. Experimental results indicate that t‐GA has the best performance among the different gene selection methods. The Z‐score figure also shows that some genes are consistently preferentially chosen by t‐GA in each data set.  相似文献   

4.
There exist several methods for binary classification of gene expression data sets. However, in the majority of published methods, little effort has been made to minimize classifier complexity. In view of the small number of samples available in most gene expression data sets, there is a strong motivation for minimizing the number of free parameters that must be fitted to the data. In this paper, a method is introduced for evolving (using an evolutionary algorithm) simple classifiers involving a minimal subset of the available genes. The classifiers obtained by this method perform well, reaching 97% correct classification of clinical outcome on training samples from the breast cancer data set published by van't Veer, and up to 89% correct classification on validation samples from the same data set, easily outperforming previously published results.  相似文献   

5.
Genomic data, and more generally biomedical data, are often characterized by high dimensionality. An input selection procedure can attain the two objectives of highlighting the relevant variables (genes) and possibly improving classification results. In this paper, we propose a wrapper approach to gene selection in classification of gene expression data using simulated annealing along with supervised classification. The proposed approach can perform global combinatorial searches through the space of all possible input subsets, can handle cases with numerical, categorical or mixed inputs, and is able to find (sub-)optimal subsets of inputs giving low classification errors. The method has been tested on publicly available bioinformatics data sets using support vector machines and on a mixed type data set using classification trees. We also propose some heuristics able to speed up the convergence. The experimental results highlight the ability of the method to select minimal sets of relevant features.  相似文献   

6.
基于基因表达谱对组织样本进行分类,在疾病诊断领域,是个非常重要的研究课题。在基因表达数据中,基因的数量(几千个)相对于数据样本(几十个)的个数通常比较多;也就是说,数据的维数相比于数据点的个数来说比较高(这个就是采样不足问题)。过高的维数(特征或基因数)将给分类问题带来极大的挑战。提出了结合非相关线性判别式分析方法(ULDA)和支持向量机(SVM)分类算法,对结肠癌组织样本进行分类识别,并同其他方法作了比较研究,分类效果得到了提高;结果表明了该方法的可行性和有效性。  相似文献   

7.
Cancer classification is one of the major applications of the microarray technology. When standard machine learning techniques are applied for cancer classification, they face the small sample size (SSS) problem of gene expression data. The SSS problem is inherited from large dimensionality of the feature space (due to large number of genes) compared to the small number of samples available. In order to overcome the SSS problem, the dimensionality of the feature space is reduced either through feature selection or through feature extraction. Linear discriminant analysis (LDA) is a well-known technique for feature extraction-based dimensionality reduction. However, this technique cannot be applied for cancer classification because of the singularity of the within-class scatter matrix due to the SSS problem. In this paper, we use Gradient LDA technique which avoids the singularity problem associated with the within-class scatter matrix and shown its usefulness for cancer classification. The technique is applied on three gene expression datasets; namely, acute leukemia, small round blue-cell tumour (SRBCT) and lung adenocarcinoma. This technique achieves lower misclassification error as compared to several other previous techniques.  相似文献   

8.
随着大规模基因表达谱技术的发展,基于基因表达谱的癌症诊断方法正在成为临床医学上一种快速有效的诊断方法,但是由于基因表达数据维数过高、样本量小、噪声大,使得正确提取有关癌症的特征基因成为关键。以结肠癌肿瘤的基因表达谱数据为例,提出了结合Fisher权函数、离散傅里叶变换和主成分分析的混合特征基因提取方法,以多元Logistic回归分析和贝叶斯决策作为分类器进行肿瘤分类检测。实验结果表明,该方法对于结肠癌数据集CV识别准确率高达96.80%。  相似文献   

9.
This paper combines a powerful algorithm, called Dongguang Li (DGL) global optimization, with the methods of cancer diagnosis through gene selection and microarray analysis. A generic approach to cancer classification based on gene expression monitoring by DNA microarrays is proposed and applied to two test cancer cases, colon and leukemia. The study attempts to analyze multiple sets of genes simultaneously, for an overall global solution to the gene??s joint discriminative ability in assigning tumors to known classes. With the workable concepts and methodologies described here an accurate classification of the type and seriousness of cancer can be made. Using the orthogonal arrays for sampling and a search space reduction process, a computer program has been written that can operate on a personal laptop computer. Both the colon cancer and the leukemia microarray data can be classified 100% correctly without previous knowledge of their classes. The classification processes are automated after the gene expression data being inputted. Instead of examining a single gene at a time, the DGL method can find the global optimum solutions and construct a multi-subsets pyramidal hierarchy class predictor containing up to 23 gene subsets based on a given microarray gene expression data collection within a period of several hours. An automatically derived class predictor makes the reliable cancer classification and accurate tumor diagnosis in clinical practice possible.  相似文献   

10.
随着DNA微阵列技术的出现,大量关于不同肿瘤的基因表达谱数据集被发布到网络上,从而使得对肿瘤特征基因选择和亚型分类的研究成为生物信息学领域的热点。基于Lasso(least absolute shrinkage and selection operator)方法提出了K-split Lasso特征选择方法,其基本思想是将数据集平均划分为K份,分别使用Lasso方法对每份进行特征选择,而后将选择出来的每份特征子集合并,重新进行特征选择,得到最终的特征基因。实验采用支持向量机作为分类器,结果表明K-split Lasso方法减少了冗余特征,提高了分类精度,具有良好的稳定性。由于每次计算的维数降低,K-split Lasso方法解决了计算开销过大的问题,并在一定程度上解决了"过拟合"问题。因此K-split Lasso方法是一种有效的肿瘤特征基因选择方法。  相似文献   

11.
A microarray machine offers the capacity to measure the expression levels of thousands of genes simultaneously. It is used to collect information from tissue and cell samples regarding gene expression differences that could be useful for cancer classification. However, the urgent problems in the use of gene expression data are the availability of a huge number of genes relative to the small number of available samples, and the fact that many of the genes are not relevant to the classification. It has been shown that selecting a small subset of genes can lead to improved accuracy in the classification. Hence, this paper proposes a solution to the problems by using a multiobjective strategy in a genetic algorithm. This approach was tried on two benchmark gene expression data sets. It obtained encouraging results on those data sets as compared with an approach that used a single-objective strategy in a genetic algorithm. This work was presented in part at the 13th International Symposium on Artificial Life and Robotics, Oita, Japan, January 31–February 2, 2008  相似文献   

12.
Hong-Qiang  Hau-San  De-Shuang  Jun 《Pattern recognition》2007,40(12):3379-3392
In this paper, we address the problem of extracting gene regulation information from microarray data for cancer classification. From the biological viewpoint, a model of gene regulation probability is established where three types of gene regulation states in a tissue sample are assumed and then two regulation events correlated with the class distinction are defined. Different from the previous approaches, the proposed algorithm uses gene regulation probabilities as carriers of regulation information to select genes and construct classifiers. The proposed approach is successfully applied to two public available microarray data sets, the leukemia data and the prostate data. Experimental results suggest that gene selection based on regulation information can greatly improve cancer classification, and the classifier based on regulation information is more efficient and more stable than several previous classification algorithms.  相似文献   

13.
由于基因表达数据高维度、高噪声、小样本的特点,基因选择一直是肿瘤分类的一大挑战。为了提高肿瘤分类的精度,同时保证基因选择的效率,提出一种结合Relief-F和CART决策树的自适应粒子群优化(APSO)算法(R-C-APSO)。该方法首先利用Relief-F快速过滤大量无关基因和噪声,缩小基因选择范围;然后以CART决策树为适应度函数,用APSO算法对基因进行最终搜索。通过6个数据集的分析实验,实验结果表明,R-C-APSO拥有较高的分类精度和较快的基因选择速度,且具有良好的稳定性。  相似文献   

14.
Due to the high dimensionality of microarray gene expression data and complicated correlations in gene expression levels, statistical methods for analyzing these data often are computationally intensive, requiring special expertise for their implementation. Biologists without such expertise will benefit from a computationally efficient and easy-to-implement analytic method. In this article, we develop such a method: a two-stage empirical Bayes method for identifying differentially expressed genes. We use a special technique to reduce the dimension of the parameter space in the preliminary stage, and construct a Bayesian model in the second stage. The computation of our method is efficient and requires little calibration for real microarray gene expression data. Specifically, we employ statistical tools, including the empirical Bayes estimation and a distribution approximation approach, to speed up computation and at the same time to preserve precision. We develop a score for evaluating the magnitude of the overall differential gene expression levels based on our Bayesian model, and declare differential expression according to the posterior probabilities that their scores exceed some threshold value. The number of declarations is determined by a false discovery rate procedure.  相似文献   

15.
从癌症基因表达谱分析入手,针对基因表达谱维数高、样本少的特点,提出一种用于癌症分类的基于邻域粗糙集和概率神经网络集成的分类方法.首先利用Relief算法对基因进行排序,然后利用邻域粗糙集选取分类特征基因,最后结合概率神经网络集成分类模型进行癌症分类.实验结果表明,该方法可以快速有效地选取癌症特征基因,能获得更好的分类效...  相似文献   

16.
Due to the high dimensionality of microarray gene expression data and complicated correlations in gene expression levels, statistical methods for analyzing these data often are computationally intensive, requiring special expertise for their implementation. Biologists without such expertise will benefit from a computationally efficient and easy-to-implement analytic method. In this article, we develop such a method: a two-stage empirical Bayes method for identifying differentially expressed genes. We use a special technique to reduce the dimension of the parameter space in the preliminary stage, and construct a Bayesian model in the second stage. The computation of our method is efficient and requires little calibration for real microarray gene expression data. Specifically, we employ statistical tools, including the empirical Bayes estimation and a distribution approximation approach, to speed up computation and at the same time to preserve precision. We develop a score for evaluating the magnitude of the overall differential gene expression levels based on our Bayesian model, and declare differential expression according to the posterior probabilities that their scores exceed some threshold value. The number of declarations is determined by a false discovery rate procedure.  相似文献   

17.
DNA microarray technology has emerged as a prospective tool for diagnosis of cancer and its classification. It provides better insights of many genetic mutations occurring within a cell associated with cancer. However, thousands of gene expressions measured for each biological sample using microarray pose a great challenge. Many statistical and machine learning methods have been applied to get most relevant genes prior to cancer classification. A two phase hybrid model for cancer classification is being proposed, integrating Correlation-based Feature Selection (CFS) with improved-Binary Particle Swarm Optimization (iBPSO). This model selects a low dimensional set of prognostic genes to classify biological samples of binary and multi class cancers using Naive–Bayes classifier with stratified 10-fold cross-validation. The proposed iBPSO also controls the problem of early convergence to the local optimum of traditional BPSO. The proposed model has been evaluated on 11 benchmark microarray datasets of different cancer types. Experimental results are compared with seven other well known methods, and our model exhibited better results in terms of classification accuracy and the number of selected genes in most cases. In particular, it achieved up to 100% classification accuracy for seven out of eleven datasets with a very small sized prognostic gene subset (up to <1.5%) for all eleven datasets.  相似文献   

18.
Searching for an effective dimension reduction space is an important problem in regression, especially for high-dimensional data such as microarray data. A major characteristic of microarray data consists in the small number of observations n and a very large number of genes p. This “large p, small n” paradigm makes the discriminant analysis for classification difficult. In order to offset this dimensionality problem a solution consists in reducing the dimension. Supervised classification is understood as a regression problem with a small number of observations and a large number of covariates. A new approach for dimension reduction is proposed. This is based on a semi-parametric approach which uses local likelihood estimates for single-index generalized linear models. The asymptotic properties of this procedure are considered and its asymptotic performances are illustrated by simulations. Applications of this method when applied to binary and multiclass classification of the three real data sets Colon, Leukemia and SRBCT are presented.  相似文献   

19.
The generalized Dirichlet distribution has been shown to be a more appropriate prior than the Dirichlet distribution for naïve Bayesian classifiers. When the dimension of a generalized Dirichlet random vector is large, the computational effort for calculating the expected value of a random variable can be high. In document classification, the number of distinct words that is the dimension of a prior for naïve Bayesian classifiers is generally more than ten thousand. Generalized Dirichlet priors can therefore be inapplicable for document classification from the viewpoint of computational efficiency. In this paper, some properties of the generalized Dirichlet distribution are established to accelerate the calculation of the expected values of random variables. Those properties are then used to construct noninformative generalized Dirichlet priors for naïve Bayesian classifiers with multinomial models. Our experimental results on two document sets show that generalized Dirichlet priors can achieve a significantly higher prediction accuracy and that the computational efficiency of naïve Bayesian classifiers is preserved.  相似文献   

20.
Cancer classification through gene expression data analysis has recently emerged as an active area of research. This paper applies Genetic Algorithms (GA) for selecting a group of relevant genes from cancer microarray data. Then, the popular classifiers, such as OneR, Naïve Bayes, decision tree, and Support Vector Machine (SVM), are built on the basis of these selected genes. The performance of those classifiers is evaluated by using the publicly available gene expression data sets. Experimental results indicate that the cascade of GA and SVM has the highest rank among different methods. Moreover, the gene selection operation of GA is reproducible.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号