首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Ant colony optimization (ACO) algorithms have been successfully applied in data classification, which aim at discovering a list of classification rules. However, due to the essentially random search in ACO algorithms, the lists of classification rules constructed by ACO-based classification algorithms are not fixed and may be distinctly different even using the same training set. Those differences are generally ignored and some beneficial information cannot be dug from the different data sets, which may lower the predictive accuracy. To overcome this shortcoming, this paper proposes a novel classification rule discovery algorithm based on ACO, named AntMinermbc, in which a new model of multiple rule sets is presented to produce multiple lists of rules. Multiple base classifiers are built in AntMinermbc, and each base classifier is expected to remedy the weakness of other base classifiers, which can improve the predictive accuracy by exploiting the useful information from various base classifiers. A new heuristic function for ACO is also designed in our algorithm, which considers both of the correlation and coverage for the purpose to avoid deceptive high accuracy. The performance of our algorithm is studied experimentally on 19 publicly available data sets and further compared to several state-of-the-art classification approaches. The experimental results show that the predictive accuracy obtained by our algorithm is statistically higher than that of the compared targets.  相似文献   

2.
The prediction of bank performance is an important issue. The bad performance of banks may first result in bankruptcy, which is expected to influence the economics of the country eventually. Since the early 1970s, many researchers had already made predictions on such issues. However, until recent years, most of them have used traditional statistics to build the prediction model. Because of the vigorous development of data mining techniques, many researchers have begun to apply those techniques to various fields, including performance prediction systems. However, data mining techniques have the problem of parameter settings. Therefore, this study applies particle swarm optimization (PSO) to obtain suitable parameter settings for support vector machine (SVM) and decision tree (DT), and to select a subset of beneficial features, without reducing the classification accuracy rate. In order to evaluate the proposed approaches, dataset collected from Taiwanese commercial banks are used as source data. The experimental results showed that the proposed approaches could obtain a better parameter setting, reduce unnecessary features, and improve the accuracy of classification significantly.  相似文献   

3.
Contemporary biological technologies produce extremely high-dimensional data sets from which to design classifiers, with 20,000 or more potential features being common place. In addition, sample sizes tend to be small. In such settings, feature selection is an inevitable part of classifier design. Heretofore, there have been a number of comparative studies for feature selection, but they have either considered settings with much smaller dimensionality than those occurring in current bioinformatics applications or constrained their study to a few real data sets. This study compares some basic feature-selection methods in settings involving thousands of features, using both model-based synthetic data and real data. It defines distribution models involving different numbers of markers (useful features) versus non-markers (useless features) and different kinds of relations among the features. Under this framework, it evaluates the performances of feature-selection algorithms for different distribution models and classifiers. Both classification error and the number of discovered markers are computed. Although the results clearly show that none of the considered feature-selection methods performs best across all scenarios, there are some general trends relative to sample size and relations among the features. For instance, the classifier-independent univariate filter methods have similar trends. Filter methods such as the t-test have better or similar performance with wrapper methods for harder problems. This improved performance is usually accompanied with significant peaking. Wrapper methods have better performance when the sample size is sufficiently large. ReliefF, the classifier-independent multivariate filter method, has worse performance than univariate filter methods in most cases; however, ReliefF-based wrapper methods show performance similar to their t-test-based counterparts.  相似文献   

4.
Biological data often consist of redundant and irrelevant features. These features can lead to misleading in modeling the algorithms and overfitting problem. Without a feature selection method, it is difficult for the existing models to accurately capture the patterns on data. The aim of feature selection is to choose a small number of relevant or significant features to enhance the performance of the classification. Existing feature selection methods suffer from the problems such as becoming stuck in local optima and being computationally expensive. To solve these problems, an efficient global search technique is needed.Black Hole Algorithm (BHA) is an efficient and new global search technique, inspired by the behavior of black hole, which is being applied to solve several optimization problems. However, the potential of BHA for feature selection has not been investigated yet. This paper proposes a Binary version of Black Hole Algorithm called BBHA for solving feature selection problem in biological data. The BBHA is an extension of existing BHA through appropriate binarization. Moreover, the performances of six well-known decision tree classifiers (Random Forest (RF), Bagging, C5.0, C4.5, Boosted C5.0, and CART) are compared in this study to employ the best one as an evaluator of proposed algorithm.The performance of the proposed algorithm is tested upon eight publicly available biological datasets and is compared with Particle Swarm Optimization (PSO), Genetic Algorithm (GA), Simulated Annealing (SA), and Correlation based Feature Selection (CFS) in terms of accuracy, sensitivity, specificity, Matthews’ Correlation Coefficient (MCC), and Area Under the receiver operating characteristic (ROC) Curve (AUC). In order to verify the applicability and generality of the BBHA, it was integrated with Naive Bayes (NB) classifier and applied on further datasets on the text and image domains.The experimental results confirm that the performance of RF is better than the other decision tree algorithms and the proposed BBHA wrapper based feature selection method is superior to BPSO, GA, SA, and CFS in terms of all criteria. BBHA gives significantly better performance than the BPSO and GA in terms of CPU Time, the number of parameters for configuring the model, and the number of chosen optimized features. Also, BBHA has competitive or better performance than the other methods in the literature.  相似文献   

5.
Web Services are being increasingly used for implementing large-scale e-business applications, but at present there is a lack of comprehensive methodologies based on sound engineering principles that can guide designers of service-oriented applications. This lack of methodological support is likely to lead to poorly designed and difficult to maintain e-business applications. In this paper we describe a design method for service-oriented applications that applies data engineering principles and the theoretical framework of data normalization to service design to produce a set of orthogonal services with normalized interfaces. We consider the impact of increasing service granularity on cohesion and coupling of service operations, and discuss associated design trade-offs. We use a travel example based on the Open Travel Alliance specification to illustrate how a document-oriented standard can be transformed into a set of well-designed service interfaces.  相似文献   

6.
While extensive research in data mining has been devoted to developing better feature selection techniques, none of this research has examined the intrinsic relationship between dataset characteristics and a feature selection technique’s performance. Thus, our research examines experimentally how dataset characteristics affect both the accuracy and the time complexity of feature selection. To evaluate the performance of various feature selection techniques on datasets of different characteristics, extensive experiments with five feature selection techniques, three types of classification algorithms, seven types of dataset characterization methods and all possible combinations of dataset characteristics are conducted on 128 publicly available datasets. We apply the decision tree method to evaluate the interdependencies between dataset characteristics and performance. The results of the study reveal the intrinsic relationship between dataset characteristics and feature selection techniques’ performance. Additionally, our study contributes to research in data mining by providing a roadmap for future research on feature selection and a significantly wider framework for comparative analysis.  相似文献   

7.
Data gravitation based classification (DGC) is a novel data classification technique based on the concept of data gravitation. The basic principle of DGC algorithm is to classify data samples by comparing the data gravitation between the different data classes. In the DGC model, a kind of “force” called data gravitation between two data samples is computed. Data from the same class are combined as a result of gravitation. On the other hand, data gravitation between different data classes can be compared. A larger gravitation from a class means the data sample belongs to a particular class. One outstanding advantage of the DGC, in comparison with other classification algorithms is its simple classification principle with high performance. This makes the DGC algorithm much easier to be implemented. Feature selection plays an important role in classification problems and a novel feature selection algorithm is investigated based on the idea of DGC and weighted features. The proposed method is validated by using 12 well-known classification data sets from UCI machine learning repository. Experimental results illustrate that the proposed method is very efficient for data classification and feature selection.  相似文献   

8.
Feature selection targets the identification of which features of a dataset are relevant to the learning task. It is also widely known and used to improve computation times, reduce computation requirements, and to decrease the impact of the curse of dimensionality and enhancing the generalization rates of classifiers. In data streams, classifiers shall benefit from all the items above, but more importantly, from the fact that the relevant subset of features may drift over time. In this paper, we propose a novel dynamic feature selection method for data streams called Adaptive Boosting for Feature Selection (ABFS). ABFS chains decision stumps and drift detectors, and as a result, identifies which features are relevant to the learning task as the stream progresses with reasonable success. In addition to our proposed algorithm, we bring feature selection-specific metrics from batch learning to streaming scenarios. Next, we evaluate ABFS according to these metrics in both synthetic and real-world scenarios. As a result, ABFS improves the classification rates of different types of learners and eventually enhances computational resources usage.  相似文献   

9.
Gene expression microarray is a rapidly maturing technology that provides the opportunity to assay the expression levels of thousands or tens of thousands of genes in a single experiment. We present a new heuristic to select relevant gene subsets in order to further use them for the classification task. Our method is based on the statistical significance of adding a gene from a ranked-list to the final subset. The efficiency and effectiveness of our technique is demonstrated through extensive comparisons with other representative heuristics. Our approach shows an excellent performance, not only at identifying relevant genes, but also with respect to the computational cost.  相似文献   

10.
Vanessa  Michel  Jrme 《Neurocomputing》2009,72(16-18):3580
The classification of functional or high-dimensional data requires to select a reduced subset of features among the initial set, both to help fighting the curse of dimensionality and to help interpreting the problem and the model. The mutual information criterion may be used in that context, but it suffers from the difficulty of its estimation through a finite set of samples. Efficient estimators are not designed specifically to be applied in a classification context, and thus suffer from further drawbacks and difficulties. This paper presents an estimator of mutual information that is specifically designed for classification tasks, including multi-class ones. It is combined to a recently published stopping criterion in a traditional forward feature selection procedure. Experiments on both traditional benchmarks and on an industrial functional classification problem show the added value of this estimator.  相似文献   

11.
Identification of relevant genes from microarray data is an apparent need in many applications. For such identification different ranking techniques with different evaluation criterion are used, which usually assign different ranks to the same gene. As a result, different techniques identify different gene subsets, which may not be the set of significant genes. To overcome such problems, in this study pipelining the ranking techniques is suggested. In each stage of pipeline, few of the lower ranked features are eliminated and at the end a relatively good subset of feature is preserved. However, the order in which the ranking techniques are used in the pipeline is important to ensure that the significant genes are preserved in the final subset. For this experimental study, twenty four unique pipeline models are generated out of four gene ranking strategies. These pipelines are tested with seven different microarray databases to find the suitable pipeline for such task. Further the gene subset obtained is tested with four classifiers and four performance metrics are evaluated. No single pipeline dominates other pipelines in performance; therefore a grading system is applied to the results of these pipelines to find out a consistent model. The finding of grading system that a pipeline model is significant is also established by Nemenyi post-hoc hypothetical test. Performance of this pipeline model is compared with four ranking techniques, though its performance is not superior always but majority of time it yields better results and can be suggested as a consistent model. However it requires more computational time in comparison to single ranking techniques.  相似文献   

12.
Feature selection is often required as a preliminary step for many pattern recognition problems. However, most of the existing algorithms only work in a centralized fashion, i.e. using the whole dataset at once. In this research a new method for distributing the feature selection process is proposed. It distributes the data by features, i.e. according to a vertical distribution, and then performs a merging procedure which updates the feature subset according to improvements in the classification accuracy. The effectiveness of our proposal is tested on microarray data, which has brought a difficult challenge for researchers due to the high number of gene expression contained and the small samples size. The results on eight microarray datasets show that the execution time is considerably shortened whereas the performance is maintained or even improved compared to the standard algorithms applied to the non-partitioned datasets.  相似文献   

13.
Empirical optimizers like ATLAS have been very effective in optimizing computational kernels in libraries. The best choice of parameters such as tile size and degree of loop unrolling is determined in ATLAS by executing different versions of the computation. In contrast, optimizing compilers use a model-driven approach to program transformation. While the model-driven approach of optimizing compilers is generally orders of magnitude faster than ATLAS-like library generators, its effectiveness can be limited by the accuracy of the performance models used. In this paper, we describe an approach where a class of computations is modeled in terms of constituent operations that are empirically measured, thereby allowing modeling of the overall execution time. The performance model with empirically determined cost components is used to select library calls and choose data layout transformations in the context of the Tensor Contraction Engine, a compiler for a high-level domain-specific language for expressing computational models in quantum chemistry. The effectiveness of the approach is demonstrated through experimental measurements on representative computations from quantum chemistry.  相似文献   

14.
A semi-physical fusion approach that uses the MODIS BRDF/Albedo land surface characterization product and Landsat ETM+ data to predict ETM+ reflectance on the same, an antecedent, or subsequent date is presented. The method may be used for ETM+ cloud/cloud shadow and SLC-off gap filling and for relative radiometric normalization. It is demonstrated over three study sites, one in Africa and two in the U.S. (Oregon and Idaho) that were selected to encompass a range of land cover land use types and temporal variations in solar illumination, land cover, land use, and phenology. Specifically, the 30 m ETM+ spectral reflectance is predicted for a desired date as the product of observed ETM+ reflectance and the ratio of the 500 m surface reflectance modeled using the MODIS BRDF spectral model parameters and the sun-sensor geometry on the predicted and observed Landsat dates. The difference between the predicted and observed ETM+ reflectance (prediction residual) is compared with the difference between the ETM+ reflectance observed on the two dates (temporal residual) and with respect to the MODIS BRDF model parameter quality. For all three scenes, and all but the shortest wavelength band, the mean prediction residual is smaller than the mean temporal residual, by up to a factor of three. The accuracy is typically higher at ETM+ pixel locations where the MODIS BRDF model parameters are derived using the best quality inversions. The method is most accurate for the ETM+ near-infrared (NIR) band; mean NIR prediction residuals are 9%, 12% and 14% of the mean NIR scene reflectance of the African, Oregon and Idaho sites respectively. The developed fusion approach may be applied to any high spatial resolution satellite data, does not require any tuning parameters and so may be automated, is applied on a per-pixel basis and is unaffected by the presence of missing or contaminated neighboring Landsat pixels, accommodates for temporal variations due to surface changes (e.g., phenological, land cover/land use variations) observable at the 500 m MODIS BRDF/Albedo product resolution, and allows for future improvements through BRDF model refinement and error assessment.  相似文献   

15.
An intelligent identification system for mixed anuran vocalizations is developed in this work to provide the public to easily consult online. The raw mixed anuran vocalization samples are first filtered by noise removal, high frequency compensation, and discrete wavelet transform techniques in order. An adaptive end-point detection segmentation algorithm is proposed to effectively separate the individual syllables from the noise. Six features, including spectral centroid, signal bandwidth, spectral roll-off, threshold-crossing rate, spectral flatness, and average energy, are extracted and served as the input parameters of the classifier. Meanwhile, a decision tree is constructed based on several parameters obtained during sample collection in order to narrow the scope of identification targets. Then fast learning neural-networks are employed to classify the anuran species based on feature set chosen by wrapper feature selection method. A series of experiments were conducted to measure the outcome performance of the proposed work. Experimental results exhibit that the recognition rate of the proposed identification system can achieve up to 93.4%. The effectiveness of the proposed identification system for anuran vocalizations is thus verified.  相似文献   

16.
Generating prediction rules for liquefaction through data mining   总被引:1,自引:0,他引:1  
Prediction of liquefaction is an important subject in geotechnical engineering. Prediction of liquefaction is also a complex problem as it depends on many different physical factors, and the relations between these factors are highly non-linear and complex. Several approaches have been proposed in the literature for modeling and prediction of liquefaction. Most of these approaches are based on classical statistical approaches and neural networks. In this paper a new approach which is based on classification data mining is proposed first time in the literature for liquefaction prediction. The proposed approach is based on extracting accurate classification rules from neural networks via ant colony optimization. The extracted classification rules are in the form of IF–THEN rules which can be easily understood by human. The proposed algorithm is also compared with several other data mining algorithms. It is shown that the proposed algorithm is very effective and accurate in prediction of liquefaction.  相似文献   

17.
对蚁群算法杂数据挖掘中的分类任务的应用进行了研究,算法实质上是利用蚁群觅食原理在数据库中进行搜索,对随机产生的一组规则进行选择优化,直到数据库能被该组规则覆盖,从而挖掘出隐含在数据库中的规则。  相似文献   

18.
Aggregation pheromone density based data clustering   总被引:1,自引:0,他引:1  
Ants, bees and other social insects deposit pheromone (a type of chemical) in order to communicate between the members of their community. Pheromone, that causes clumping or clustering behavior in a species and brings individuals into a closer proximity, is called aggregation pheromone. This article presents a new algorithm (called, APC) for clustering data sets based on this property of aggregation pheromone found in ants. An ant is placed at each location of a data point, and the ants are allowed to move in the search space to find points with higher pheromone density. The movement of an ant is governed by the amount of pheromone deposited at different points of the search space. More the deposited pheromone, more is the aggregation of ants. This leads to the formation of homogenous groups of data. The proposed algorithm is evaluated on a number of well-known benchmark data sets using different cluster validity measures. Results are compared with those obtained using two popular standard clustering techniques namely average linkage agglomerative and k-means clustering algorithm and with an ant-based method called adaptive time-dependent transporter ants for clustering (ATTA-C). Experimental results justify the potentiality of the proposed APC algorithm both in terms of the solution (clustering) quality as well as execution time compared to other algorithms for a large number of data sets.  相似文献   

19.
Preprocessing the data to filter out redundant and irrelevant features is one of the most important steps in the data mining process. Careful feature selection may improve both the computational time of inducing subsequent models and the quality of those models. Using fewer features often leads to simpler and easier to interpret models, and selecting important feature can lead to important insights into the application. The feature selection problem is inherently a combinatorial optimization problem. This paper builds on a metaheuristic called the nested partitions method that has been shown to be particularly effective for the feature selection problem. Specifically, we focus on the scalability of the method and show that its performance is vastly improved by incorporating random sampling of instances. Furthermore, we develop an adaptive variant of the algorithm that dynamically determines the required sample rate. The adaptive algorithm is shown to perform very well when applied to a set of standard test problems.  相似文献   

20.
Instance selection aims at filtering out noisy data (or outliers) from a given training set, which not only reduces the need for storage space, but can also ensure that the classifier trained by the reduced set provides similar or better performance than the baseline classifier trained by the original set. However, since there are numerous instance selection algorithms, there is no concrete winner that is the best for various problem domain datasets. In other words, the instance selection performance is algorithm and dataset dependent. One main reason for this is because it is very hard to define what the outliers are over different datasets. It should be noted that, using a specific instance selection algorithm, over-selection may occur by filtering out too many ‘good’ data samples, which leads to the classifier providing worse performance than the baseline. In this paper, we introduce a dual classification (DuC) approach, which aims to deal with the potential drawback of over-selection. Specifically, performing instance selection over a given training set, two classifiers are trained using both a ‘good’ and ‘noisy’ sets respectively identified by the instance selection algorithm. Then, a test sample is used to compare the similarities between the data in the good and noisy sets. This comparison guides the input of the test sample to one of the two classifiers. The experiments are conducted using 50 small scale and 4 large scale datasets and the results demonstrate the superior performance of the proposed DuC approach over the baseline instance selection approach.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号