首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The significance of the preprocessing stage in any data mining task is well known. Before attempting medical data classification, characteristics ofmedical datasets, including noise, incompleteness, and the existence of multiple and possibly irrelevant features, need to be addressed. In this paper, we show that selecting the right combination of preprocessing methods has a considerable impact on the classification potential of a dataset. The preprocessing operations considered include the discretization of numeric attributes, the selection of attribute subset(s), and the handling of missing values. The classification is performed by an ant colony optimization algorithm as a case study. Experimental results on 25 real-world medical datasets show that a significant relative improvement in predictive accuracy, exceeding 60% in some cases, is obtained.  相似文献   

2.
The ability to provide thousands of gene expression values simultaneously makes microarray data very useful for phenotype classification. A major constraint in phenotype classification is that the number of genes greatly exceeds the number of samples. We overcame this constraint in two ways; we increased the number of samples by integrating independently generated microarrays that had been designed with the same biological objectives, and reduced the number of genes involved in the classification by selecting a small set of informative genes. We were able to maximally use the abundant microarray data that is being stockpiled by thousands of different research groups while improving classification accuracy. Our goal is to implement a feature (gene) selection method that can be applicable to integrated microarrays as well as to build a highly accurate classifier that permits straightforward biological interpretation. In this paper, we propose a two-stage approach. Firstly, we performed a direct integration of individual microarrays by transforming an expression value into a rank value within a sample and identified informative genes by calculating the number of swaps to reach a perfectly split sequence. Secondly, we built a classifier which is a parameter-free ensemble method using only the pre-selected informative genes. By using our classifier that was derived from large, integrated microarray sample datasets, we achieved high accuracy, sensitivity, and specificity in the classification of an independent test dataset.  相似文献   

3.
高光谱数据在物质分类识别领域得到了广泛应用,但存在数据量大、波段间相关性高等问题,严重影响分类精度及应用。针对以上问题分析了已有的波段选择方法,提出了基于波段聚类及监督分类的遗传算法,对高光谱数据进行波段选择:采用[K]均值聚类算法对波段数据进行聚类分析,构造波段子集合;利用分类器族分类精度构造适应度函数,采用遗传算法对波段子集合进行优化选择。最后用阔叶林高光谱数据对提出的算法进行对比实验,实验结果表明针对分类应用,提出的算法能够非常有效地选择高光谱谱段。  相似文献   

4.
互联网流量分类是识别网络应用和分类相应流量的过程,这被认为是现代网络管理和安全系统中最基本的功能。与应用相关的流量分类是网络安全的基础技术。传统的流量分类方法包括基于端口的预测方法和基于有效载荷的深度检测方法。在目前的网络环境下,传统的方法存在一些实际问题,如动态端口和加密应用,因此采用基于流量统计特征的机器学习(ML)技术来进行流量分类识别。机器学习可以利用提供的流量数据进行集中自动搜索,并描述有用的结构模式,这有助于智能地进行流量分类。起初使用朴素贝叶斯方法进行网络流量分类的识别和分类,对特定流量进行实验时,表现较好,准确度可达90%以上,但对点对点传输网络流量(P2P)等流量识别准确度仅能达到50%左右。然后有使用支持向量机(SVM)和神经网络(NN)等方法,神经网络方法使整体网络流量的分类准确度能达到80%以上。多项研究结果表明,对于多种机器学习方法的使用和后续的改进,很好地提高了流量分类的准确性。  相似文献   

5.
Abstract: Cancer classification, through gene expression data analysis, has produced remarkable results, and has indicated that gene expression assays could significantly aid in the development of efficient cancer diagnosis and classification platforms. However, cancer classification, based on DNA array data, remains a difficult problem. The main challenge is the overwhelming number of genes relative to the number of training samples, which implies that there are a large number of irrelevant genes to be dealt with. Another challenge is from the presence of noise inherent in the data set. It makes accurate classification of data more difficult when the sample size is small. We apply genetic algorithms (GAs) with an initial solution provided by t statistics, called t‐GA, for selecting a group of relevant genes from cancer microarray data. The decision‐tree‐based cancer classifier is built on the basis of these selected genes. The performance of this approach is evaluated by comparing it to other gene selection methods using publicly available gene expression data sets. Experimental results indicate that t‐GA has the best performance among the different gene selection methods. The Z‐score figure also shows that some genes are consistently preferentially chosen by t‐GA in each data set.  相似文献   

6.
Abstract: In this work an entropic filtering algorithm (EFA) for feature selection is described, as a workable method to generate a relevant subset of genes. This is a fast feature selection method based on finding feature subsets that jointly maximize the normalized multivariate conditional entropy with respect to the classification ability of tumours. The EFA is tested in combination with several machine learning algorithms on five public domain microarray data sets. It is found that this combination offers subsets yielding similar or much better accuracies than using the full set of genes. The solutions obtained are of comparable quality to previous results, but they are obtained in a maximum of half an hour computing time and use a very low number of genes.  相似文献   

7.
Identifying the Cost-To-Serve (CTS) of customers is one of the most challenging problems in Supply Chain Management because of the diversity in their business activities. For the particular case of the industrial gas business, we are interested in predicting the cost to deliver bulk (liquefied) gas to new customers using a multifactor linear regression model. Developing a single model, i.e. analyzing the observations all at once, produces poor prediction results. Therefore prior to the regression analysis, a new supervised learning technique is used to group customers who are similar in some sense. Classes of customers are represented by hyper-boxes and a linear regression model is subsequently built within each class. The combination of data classification and regression is proven to increase the accuracy of the prediction.Two Mixed-Integer-Linear Programming (MILP) models are developed for data classification purposes. Although we are dealing with a supervised learning method, classes are not predefined in our case. Rather, we input a continuous “classification” attribute that is optimally discretized by the MILP’s in order to minimize the number of misclassifications. Therefore our data classification model offers a broader range of applications. A number of illustrative examples are used to prove the effectiveness of the proposed approach.  相似文献   

8.
王乐  韩萌  李小娟  张妮  程浩东 《计算机应用》2022,42(4):1137-1147
针对数据流集成分类如何使分类器适应不断变化的数据流,调整基分类器的权重选择合适的分类器集合的问题,提出了一种基于动态加权函数的集成分类算法。首先,提出了一种加权函数调节基分类器的权重,使用不断更新的数据块训练分类器;然后,使用一个新的权重函数对候选分类器进行一个合理的选择;最后,在基分类器中应用决策树的增量性质,实现对数据流的分类。通过大量实验发现,基于动态加权函数的集成分类算法的性能不受块的大小影响,与AUE2算法相比,叶子数平均减少了681.3、节点数平均减少了1 192.8,树的深度平均减少了4.42,同时相对地提高了准确率,降低了消耗时间。实验结果表明该算法在对数据流进行分类时不但可以保证准确率还可以节省大量的内存空间和时间。  相似文献   

9.
基于遗传算法的结肠癌基因选择与样本分类   总被引:2,自引:1,他引:1       下载免费PDF全文
提出了一种基于两轮遗传算法的用于结肠癌微阵列数据基因选择与样本分类的新方法。该方法先根据基因的Bhattacharyya距离指标过滤大部分与分类不相关的基因,而后使用结合了遗传算法和CFS(Correlation-based Feature Selection)的GA/CFS方法选择优秀基因子集,并存档记录这些子集。根据存档子集中基因被选择的频率选择进一步搜索的候选子集,最后以结合了遗传算法和SVM的GA/SVM从候选基因子集中选择分类特征子集。把这种GA/CFS-GA/SVM方法应用到结肠癌微阵列数据,实验结果及与文献的比较表明了该方法效果良好。  相似文献   

10.
刘殊 《计算机应用》2009,29(6):1582-1589
针对阴性选择算法缺乏高效的分类器生成机制和“过拟合”抑制机制的缺陷,提出了一种面向多类别模式分类的阴性选择算法CS-NSA。通过引入克隆选择机制,根据分类器的分类效果和刺激度对其进行自适应学习;针对多类别模式分类的“过拟合”问题,引入了检测器集合的修剪机制,增强了检测器的分类推广能力。对比实验结果证明:与著名的人工免疫分类器AIRS相比,CS-NSA体现出更高的正确识别率。  相似文献   

11.
一种使用DBSCAN聚类的网络流量分类方法*   总被引:1,自引:0,他引:1  
提出了基于DBSCAN算法的网络流量分类方法,对流的定义、特征产生、特征选择以及分类规则和分类性能的评测等内容进行了介绍。提出了基于PCA的网络流量最优特征子集的选择方法。实验结果表明,提出的分类方法能够达到较高的总精确度和查准率,能够有效地使用于网络流量分类中。  相似文献   

12.
自动文本分类的效果在很大程度上依赖于属性特征的选择。针对传统基于频率阈值过滤的特征选择方法会导致有效信息丢失,影响分类精度的不足,提出了一种基于粗糙集的文本自动分类算法。该方法对加权后的特征属性进行离散化,建立一个决策表;根据基于依赖度的属性重要度对决策表中条件属性进行适当的筛选;采用基于条件信息熵的启发式算法实现文本属性特征的约简。实验结果表明,该方法能约简大量冗余的特征属性,在不降低分类精度的同时,提高文本分类的运行效率。  相似文献   

13.
根据免疫否定选择原理,设计了基于掩码分段匹配的否定选择分类器,用于实现规则匹配分类。给出了适用于免疫优化的分类规则编码及分类信息分的评价标准,通过免疫进化对其进行群体优化以生成更为简洁、便于理解的数据规则集。该方法使得免疫优化的各种优良特性在数据分类中得到充分的运用,避免了传统分类算法缺乏全局优化能力的缺点,提高了对样本的识别能力。实验结果表明,这种免疫分类器及优化方法是一种有效、可行的分类器设计方案,提高了数据分类的准确性。  相似文献   

14.
In this paper, a new method is introduced which is a combination of structural and syntactic approaches for fingerprint classification. The goal of the proposed ridge distribution (R-D) model is to present the idea of the possibility for classifying a fingerprint into the complete seven classes in the Henry's classification. From our observation, there exist only 10 basic ridge patterns which construct fingerprints. Fingerprint classes can be interpreted as a combination of these 10 ridge patterns with different ridge distribution sequences. In this paper, the classification task is performed depending on the global distribution of the 10 basic ridge patterns by analyzing the ridge shapes and the sequence of ridges distribution. The regular expression for each class is formulated and a NFA model is constructed accordingly. An explicit rejection criterion is also defined in this paper. For the seven-class fingerprint classification problem, our method can achieve the classification accuracy of 93.4% with 5.1% rejection rate. For the five-class problem, the accuracy rate of 94.8% is achieved. Experimental results reveal the feasibility and validity of the proposed approach in fingerprint classification.  相似文献   

15.
针对图像分类特征点特性界定模糊,导致相似性度量误差较大的问题,提出采用特征点类别可分性判断准则的图像分类方法。结合信息熵理论提取图像特征点的可分性特性,根据图像特征向量标识决策属性的不同性质,计算特征向量间的可分性距离值,得到最近邻特征向量集,从待分图像各特征向量与最近邻特征向量集标识类别的平均距离,及平均可分性度量值两方面定义新的图像类别判断准则。理论分析与Caltech256图像库仿真实验表明,基于特征点类别可分性判断准则有效地提高了图像的分类准确率。  相似文献   

16.
The problem of classifying traffic flows in networks has become more and more important in recent times, and much research has been dedicated to it. In recent years, there has been a lot of interest in classifying traffic flows by application, based on the statistical features of each flow. Information about the applications that are being used on a network is very useful in network design, accounting, management, and security. In our previous work we proposed a classification algorithm for Internet traffic flow classification based on Artificial Immune Systems (AIS). We also applied the algorithm on an available data set, and found that the algorithm performed as well as other algorithms, and was insensitive to input parameters, which makes it valuable for embedded systems. It is also very simple to implement, and generalizes well from small training data sets. In this research, we expanded on the previous research by introducing several optimizations in the training and classification phases of the algorithm. We improved the design of the original algorithm in order to make it more predictable. We also give the asymptotic complexity of the optimized algorithm as well as draw a bound on the generalization error of the algorithm. Lastly, we also experimented with several different distance formulas to improve the classification performance. In this paper we have shown how the changes and optimizations applied to the original algorithm do not functionally change the original algorithm, while making its execution 50–60% faster. We also show that the classification accuracy of the Euclidian distance is superseded by the Manhattan distance for this application, giving 1–2% higher accuracy, making the accuracy of the algorithm comparable to that of a Naïve Bayes classifier in previous research that uses the same data set.  相似文献   

17.

朴素贝叶斯分类器不能有效地利用属性之间的依赖信息, 而目前所进行的依赖扩展更强调效率, 使扩展后分类器的分类准确性还有待提高. 针对以上问题, 在使用具有平滑参数的高斯核函数估计属性密度的基础上, 结合分类器的分类准确性标准和属性父结点的贪婪选择, 进行朴素贝叶斯分类器的网络依赖扩展. 使用UCI 中的连续属性分类数据进行实验, 结果显示网络依赖扩展后的分类器具有良好的分类准确性.

  相似文献   

18.
由于基因表达数据高维度、高噪声、小样本的特点,基因选择一直是肿瘤分类的一大挑战。为了提高肿瘤分类的精度,同时保证基因选择的效率,提出一种结合Relief-F和CART决策树的自适应粒子群优化(APSO)算法(R-C-APSO)。该方法首先利用Relief-F快速过滤大量无关基因和噪声,缩小基因选择范围;然后以CART决策树为适应度函数,用APSO算法对基因进行最终搜索。通过6个数据集的分析实验,实验结果表明,R-C-APSO拥有较高的分类精度和较快的基因选择速度,且具有良好的稳定性。  相似文献   

19.
针对块匹配运动估计算法中传统搜索方法的不足,提出了一种新的基于混合粒子群的块匹配运动估计算法。在保留系统随机搜索性能的同时根据运动矢量特性合理地设计初始搜索种群,并通过混沌差分进化搜索协同粒子群算法迭代寻优,混沌序列用于优化差分变异算子,以提高算法的精细搜索能力。通过相同点检测技术和恰当的终止计划有效地降低了系统的运算复杂度。经实验测试与验证,该算法在搜索质量和运算复杂度中达到了一种动态平衡的状态,其整体性能高于传统的快速运动估计算法,效果更逼近于穷举搜索法。  相似文献   

20.
Bayesian networks are graphical models that describe dependency relationships between variables, and are powerful tools for studying probability classifiers. At present, the causal Bayesian network learning method is used in constructing Bayesian network classifiers while the contribution of attribute to class is over-looked. In this paper, a Bayesian network specifically for classification-restricted Bayesian classification networks is proposed. Combining dependency analysis between variables, classification accuracy evaluation criteria and a search algorithm, a learning method for restricted Bayesian classification networks is presented. Experiments and analysis are done using data sets from UCI machine learning repository. The results show that the restricted Bayesian classification network is more accurate than other well-known classifiers.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号