首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 390 毫秒
1.
特征选择技术是机器学习和数据挖掘任务的关键预处理技术。传统贪婪式特征选择方法仅考虑本轮最佳特征,从而导致获取的特征子集仅为局部最优,无法获得最优或者近似最优的特征集合。进化搜索方式则有效地对特征空间进行搜索,然而不同的进化算法在搜索过程中存在自身的局限。本文吸取遗传算法(GA)和粒子群优化算法(PSO)的进化优势,以信息熵度量为评价,通过协同演化的方式获取最终特征子集。并提出适用于特征选择问题特有的比特率交叉算子和信息交换策略。实验结果显示,遗传算法和粒子群协同进化(GA-PSO)在进化搜索特征子集的能力和具体分类学习任务上都优于单独的演化搜索方式。进化搜索提供的组合判断能力优于贪婪式特征选择方法。  相似文献   

2.
特征选择一直是机器学习和数据挖掘中的一个重要问题。在多标签学习任务中,数据集中的每个样本都与多个标签相关联,标签与标签之间通常也是相关的。在多标签高维数据分析中,为降低特征维数和提高分类性能,研究者们提出了多标签特征选择方法。系统综述了多标签特征选择的研究进展。在介绍多标签分类以及评价准则之后,详细分析了多标签特征选择的三类方法,即过滤式算法、包裹式算法和嵌入式算法,对多标签特征选择未来的研究提出展望。  相似文献   

3.
During the last decade, the deluge of multimedia data has impacted a wide range of research areas, including multimedia retrieval, 3D tracking, database management, data mining, machine learning, social media analysis, medical imaging, and so on. Machine learning is largely involved in multimedia applications of building models for classification and regression tasks, etc., and the learning principle consists in designing the models based on the information contained in the multimedia dataset. While many paradigms exist and are widely used in the context of machine learning, most of them suffer from the ‘curse of dimensionality’, which means that some strange phenomena appears when data are represented in a high-dimensional space. Given the high dimensionality and the high complexity of multimedia data, it is important to investigate new machine learning algorithms to facilitate multimedia data analysis. To deal with the impact of high dimensionality, an intuitive way is to reduce the dimensionality. On the other hand, some researchers devoted themselves to designing some effective learning schemes for high-dimensional data. In this survey, we cover feature transformation, feature selection and feature encoding, three approaches fighting the consequences of the curse of dimensionality. Next, we briefly introduce some recent progress of effective learning algorithms. Finally, promising future trends on multimedia learning are envisaged.  相似文献   

4.

One of the major challenges in cyber space and Internet of things (IoT) environments is the existence of fake or phishing websites that steal users’ information. A website as a multimedia system provides access to different types of data such as text, image, video, audio. Each type of these data are prune to be used by fishers to perform a phishing attack. In phishing attacks, people are directed to fake pages and their important information is stolen by a thief or phisher. Machine learning and data mining algorithms are the widely used algorithms for classifying websites and detecting phishing attacks. Classification accuracy is highly dependent on the feature selection method employed to choose appropriate features for classification. In this research, an improved spotted hyena optimization algorithm (ISHO algorithm) is proposed to select proper features for classifying phishing websites through support vector machine. The proposed ISHO algorithm outperformed the standard spotted hyena optimization algorithm with better accuracy. In addition, the results indicate the superiority of ISHO algorithm to three other meta-heuristic algorithms including particle swarm optimization, firefly algorithm, and bat algorithm. The proposed algorithm is also compared with a number of classification algorithms proposed before on the same dataset.

  相似文献   

5.
特征选择是机器学习和数据挖据中一个重要的预处理步骤,而类别不均衡数据的特征选择是机器学习和模式识别中的一个热点研究问题。多数传统的特征选择分类算法追求高精度,并假设数据没有误分类代价或者有同样的代价。在现实应用中,不同的误分类往往会产生不同的误分类代价。为了得到最小误分类代价下的特征子集,本文提出一种基于样本邻域保持的代价敏感特征选择算法。该算法的核心思想是把样本邻域引入现有的代价敏感特征选择框架。在8个真实数据集上的实验结果表明了该算法的优越性。  相似文献   

6.
运动相关电位(MRPs)机理复杂、形式多变,使得对基于MRPs的脑电信号的特征提取和数据挖掘工作很具有挑战性。本文目的是要将多种机器学习和语义范式模型应用于对脑电信号的数据挖掘,以应对上述挑战。本文采用多种机器学习算法和信号处理方法进行分析和实验对比,并给出对应不同场景、目标的最佳模型。为了将跨度较大的模糊性的电生理信号、兼容多种信号的深度学习和明确的语义模型各领域无缝地衔接,实现了一个以脑电信号数据为研究对象的语义范式框架,赋予复杂信号以文法、语法和语义内涵,为深度神经网络构筑了语义解释。通过该范式框架能够找出脑电信号中特定语义的信息块以及这些信息块之间的语义组合,自动学习出高效的滤波器,达到准确率高、传输通量大、普适性强的效果。  相似文献   

7.
While there is an ample amount of medical information available for data mining, many of the datasets are unfortunately incomplete – missing relevant values needed by many machine learning algorithms. Several approaches have been proposed for the imputation of missing values, using various reasoning steps to provide estimations from the observed data. One of the important steps in data mining is data preprocessing, where unrepresentative data is filtered out of the data to be mined. However, none of the related studies about missing value imputation consider performing a data preprocessing step before imputation. Therefore, the aim of this study is to examine the effect of two preprocessing steps, feature and instance selection, on missing value imputation. Specifically, eight different medical‐related datasets are used, containing categorical, numerical and mixed types of data. Our experimental results show that imputation after instance selection can produce better classification performance than imputation alone. In addition, we will demonstrate that imputation after feature selection does not have a positive impact on the imputation result.  相似文献   

8.
特征选择是数据挖掘、机器学习和模式识别中始终面临的一个重要问题。针对类和特征分布不均时,传统信息增益在特征选择中存在的选择偏好问题,本文提出了一种基于信息增益率与随机森林的特征选择算法。该算法结合Filter和Wrapper模式的优点,首先从信息相关性和分类能力两个方面对特征进行综合度量,然后采用序列前向选择(Sequential Forward Selection, SFS)策略对特征进行选择,并以分类精度作为评价指标对特征子集进行度量,从而获取最优特征子集。实验结果表明,本文算法不仅能够达到特征空间降维的效果,而且能够有效提高分类算法的分类性能和查全率。  相似文献   

9.
Improving accuracies of machine learning algorithms is vital in designing high performance computer-aided diagnosis (CADx) systems. Researches have shown that a base classifier performance might be enhanced by ensemble classification strategies. In this study, we construct rotation forest (RF) ensemble classifiers of 30 machine learning algorithms to evaluate their classification performances using Parkinson's, diabetes and heart diseases from literature.While making experiments, first the feature dimension of three datasets is reduced using correlation based feature selection (CFS) algorithm. Second, classification performances of 30 machine learning algorithms are calculated for three datasets. Third, 30 classifier ensembles are constructed based on RF algorithm to assess performances of respective classifiers with the same disease data. All the experiments are carried out with leave-one-out validation strategy and the performances of the 60 algorithms are evaluated using three metrics; classification accuracy (ACC), kappa error (KE) and area under the receiver operating characteristic (ROC) curve (AUC).Base classifiers succeeded 72.15%, 77.52% and 84.43% average accuracies for diabetes, heart and Parkinson's datasets, respectively. As for RF classifier ensembles, they produced average accuracies of 74.47%, 80.49% and 87.13% for respective diseases.RF, a newly proposed classifier ensemble algorithm, might be used to improve accuracy of miscellaneous machine learning algorithms to design advanced CADx systems.  相似文献   

10.
特征选择作为一个数据预处理过程,在数据挖掘、模式识别和机器学习中有着重要地位。通过特征选择,可以降低问题的复杂度,提高学习算法的预测精度、鲁棒性和可解释性。介绍特征选择方法框架,重点描述生成特征子集、评价准则两个过程;根据特征选择和学习算法的不同结合方式对特征选择算法分类,并分析各种方法的优缺点;讨论现有特征选择算法存在的问题,提出一些研究难点和研究方向。  相似文献   

11.
Credit risk assessment has been a crucial issue as it forecasts whether an individual will default on loan or not. Classifying an applicant as good or bad debtor helps lender to make a wise decision. The modern data mining and machine learning techniques have been found to be very useful and accurate in credit risk predictive capability and correct decision making. Classification is one of the most widely used techniques in machine learning. To increase prediction accuracy of standalone classifiers while keeping overall cost to a minimum, feature selection techniques have been utilized, as feature selection removes redundant and irrelevant attributes from dataset. This paper initially introduces Bolasso (Bootstrap-Lasso) which selects consistent and relevant features from pool of features. The consistent feature selection is defined as robustness of selected features with respect to changes in dataset Bolasso generated shortlisted features are then applied to various classification algorithms like Random Forest (RF), Support Vector Machine (SVM), Naïve Bayes (NB) and K-Nearest Neighbors (K-NN) to test its predictive accuracy. It is observed that Bolasso enabled Random Forest algorithm (BS-RF) provides best results forcredit risk evaluation. The classifiers are built on training and test data partition (70:30) of three datasets (Lending Club’s peer to peer dataset, Kaggle’s Bank loan status dataset and German credit dataset obtained from UCI). The performance of Bolasso enabled various classification algorithms is then compared with that of other baseline feature selection methods like Chi Square, Gain Ratio, ReliefF and stand-alone classifiers (no feature selection method applied). The experimental results shows that Bolasso provides phenomenal stability of features when compared with stability of other algorithms. Jaccard Stability Measure (JSM) is used to assess stability of feature selection methods. Moreover BS-RF have good classification accuracy and is better than other methods in terms of AUC and Accuracy resulting in effectively improving the decision making process of lenders.  相似文献   

12.
Financial decisions are often based on classification models which are used to assign a set of observations into predefined groups. Such models ought to be as accurate as possible. One important step towards the development of accurate financial classification models involves the selection of the appropriate independent variables (features) which are relevant for the problem at hand. This is known as the feature selection problem in the machine learning/data mining field. In financial decisions, feature selection is often based on the subjective judgment of the experts. Nevertheless, automated feature selection algorithms could be of great help to the decision-makers providing the means to explore efficiently the solution space. This study uses two nature-inspired methods, namely ant colony optimization and particle swarm optimization, for this problem. The modelling context is developed and the performance of the methods is tested in two financial classification tasks, involving credit risk assessment and audit qualifications.  相似文献   

13.
Big data are regarded as a tremendous technology for processing a huge variety of data in a short time and with a large storage capacity. The user’s access over the internet creates massive data processing over the internet. Big data require an intelligent feature selection model by addressing huge varieties of data. Traditional feature selection techniques are only applicable to simple data mining. Intelligent techniques are needed in big data processing and machine learning for an efficient classification. Major feature selection algorithms read the input features as they are. Then, the features are preprocessed and classified. Here, an algorithm does not consider the relatedness. During feature selection, all features are misread as outputs. Accordingly, a less optimal solution is achieved. In our proposed research, we focus on the feature selection by using supervised learning techniques called grey wolf optimization (GWO) with decomposed random differential grouping (DrnDG-GWO). First, decomposition of features into subsets based on relatedness in variables is performed. Random differential grouping is performed using a fitness value of two variables. Now, every subset is regarded as a population in GWO techniques. The combination of supervised machine learning with swarm intelligence techniques produces best feature optimization results in this research. Once the features are optimized, we classify using advanced kNN process for accurate data classification. The result of DrnDG-GWO is compared with those of the standard GWO and GWO with PSO for feature selection to compare the efficiency of the proposed algorithm. The accuracy and time complexity of the proposed algorithm are 98% and 5 s, which are better than the existing techniques.  相似文献   

14.
Feature selection (attribute reduction) from large-scale incomplete data is a challenging problem in areas such as pattern recognition, machine learning and data mining. In rough set theory, feature selection from incomplete data aims to retain the discriminatory power of original features. To address this issue, many feature selection algorithms have been proposed, however, these algorithms are often computationally time-consuming. To overcome this shortcoming, we introduce in this paper a theoretic framework based on rough set theory, which is called positive approximation and can be used to accelerate a heuristic process for feature selection from incomplete data. As an application of the proposed accelerator, a general feature selection algorithm is designed. By integrating the accelerator into a heuristic algorithm, we obtain several modified representative heuristic feature selection algorithms in rough set theory. Experiments show that these modified algorithms outperform their original counterparts. It is worth noting that the performance of the modified algorithms becomes more visible when dealing with larger data sets.  相似文献   

15.
Classification is a key problem in machine learning/data mining. Algorithms for classification have the ability to predict the class of a new instance after having been trained on data representing past experience in classifying instances. However, the presence of a large number of features in training data can hurt the classification capacity of a machine learning algorithm. The Feature Selection problem involves discovering a subset of features such that a classifier built only with this subset would attain predictive accuracy no worse than a classifier built from the entire set of features. Several algorithms have been proposed to solve this problem. In this paper we discuss how parallelism can be used to improve the performance of feature selection algorithms. In particular, we present, discuss and evaluate a coarse-grained parallel version of the feature selection algorithm FortalFS. This algorithm performs well compared with other solutions and it has certain characteristics that makes it a good candidate for parallelization. Our parallel design is based on the master--slave design pattern. Promising results show that this approach is able to achieve near optimum speedups in the context of Amdahl's Law.  相似文献   

16.
“Dimensionality” is one of the major problems which affect the quality of learning process in most of the machine learning and data mining tasks. Having high dimensional datasets for training a classification model may lead to have “overfitting” of the learned model to the training data. Overfitting reduces generalization of the model, therefore causes poor classification accuracy for the new test instances. Another disadvantage of dimensionality of dataset is to have high CPU time requirement for learning and testing the model. Applying feature selection to the dataset before the learning process is essential to improve the performance of the classification task. In this study, a new hybrid method which combines artificial bee colony optimization technique with differential evolution algorithm is proposed for feature selection of classification tasks. The developed hybrid method is evaluated by using fifteen datasets from the UCI Repository which are commonly used in classification problems. To make a complete evaluation, the proposed hybrid feature selection method is compared with the artificial bee colony optimization, and differential evolution based feature selection methods, as well as with the three most popular feature selection techniques that are information gain, chi-square, and correlation feature selection. In addition to these, the performance of the proposed method is also compared with the studies in the literature which uses the same datasets. The experimental results of this study show that our developed hybrid method is able to select good features for classification tasks to improve run-time performance and accuracy of the classifier. The proposed hybrid method may also be applied to other search and optimization problems as its performance for feature selection is better than pure artificial bee colony optimization, and differential evolution.  相似文献   

17.
In the past few years, the bottleneck for machine learning developers is not longer the limited data available but the algorithms inability to use all the data in the available time. For this reason, researches are now interested not only in the accuracy but also in the scalability of the machine learning algorithms. To deal with large-scale databases, feature selection can be helpful to reduce their dimensionality, turning an impracticable algorithm into a practical one. In this research, the influence of several feature selection methods on the scalability of four of the most well-known training algorithms for feedforward artificial neural networks (ANNs) will be analyzed over both classification and regression tasks. The results demonstrate that feature selection is an effective tool to improve scalability.  相似文献   

18.
Mining and utilizing coal resources play an influential role in economic development. In this regard, the feature information extraction in the area is researched to accurately and efficiently assist the production arrangement and deployment in the mining area. First, the detection ability of Hyperspectral Remote Sensing Image (HRSI) technology is analyzed. It has high spectral resolution and many bands. Specific bands can be extracted as needed to highlight target features. According to the characteristics of HRSIs, the data spectrum information and spatial information are comprehensively utilized, and the Convolutional Neural Network (CNN) based on deep learning is employed for feature extraction. CNN allows the machine to automatically obtain data features by learning and guide the classification of features. Taking the Liuyuan research area in Gansu as an example, three CNN models are used to extract and classify the ground features in the area. The VGG-19 model can provide the highest classification accuracy rate, reaching 87.3%; the VGG-16 model has the highest classification accuracy rate of the ground in the mining area, reaching 95.2%. ResNet model has the best effect on road classification. Then, the lithology classification is applied based on Thermal Airborne Hyperspectral Imager (TASI) data. The noise level of the first 20 bands is comparatively stable; afterward, it increases exponentially, showing a higher noise level, and the spectrum curve of the data after denoising becomes smoother. The end-member extraction method is employed to extract 25 end-member spectra of almost all lithology in the research area from the image. The similarity coefficient clustering analysis is employed to group the curves, which are divided into six categories in total. The separability of similar categories can be constrained by the objective function using the dictionary learning method, and the accuracy of the sparse representation of the category spectrum can be improved. The spectral matching method is used to subdivide each group of mapping results, suggesting that in the research area, granite is the most widely distributed, followed by diorite, andesite, and quartzite. Deep learning algorithms are applied to extract ground feature information, which is of great significance to the safety production in the mining area. The hyperspectral remote sensing rock and mineral thematic information extraction module is developed, which preliminarily realizes the quantitative acquisition and high-precision identification of typical mineral information, and provides technical support for the research of remote sensing geological evaluation technology of resource exploration in the new era.  相似文献   

19.
Dimensionality reduction is an important and challenging task in machine learning and data mining. Feature selection and feature extraction are two commonly used techniques for decreasing dimensionality of the data and increasing efficiency of learning algorithms. Specifically, feature selection realized in the absence of class labels, namely unsupervised feature selection, is challenging and interesting. In this paper, we propose a new unsupervised feature selection criterion developed from the viewpoint of subspace learning, which is treated as a matrix factorization problem. The advantages of this work are four-fold. First, dwelling on the technique of matrix factorization, a unified framework is established for feature selection, feature extraction and clustering. Second, an iterative update algorithm is provided via matrix factorization, which is an efficient technique to deal with high-dimensional data. Third, an effective method for feature selection with numeric data is put forward, instead of drawing support from the discretization process. Fourth, this new criterion provides a sound foundation for embedding kernel tricks into feature selection. With this regard, an algorithm based on kernel methods is also proposed. The algorithms are compared with four state-of-the-art feature selection methods using six publicly available datasets. Experimental results demonstrate that in terms of clustering results, the proposed two algorithms come with better performance than the others for almost all datasets we experimented with here.  相似文献   

20.
With the proliferation of extremely high-dimensional data, feature selection algorithms have become indispensable components of the learning process. Strangely, despite extensive work on the stability of learning algorithms, the stability of feature selection algorithms has been relatively neglected. This study is an attempt to fill that gap by quantifying the sensitivity of feature selection algorithms to variations in the training set. We assess the stability of feature selection algorithms based on the stability of the feature preferences that they express in the form of weights-scores, ranks, or a selected feature subset. We examine a number of measures to quantify the stability of feature preferences and propose an empirical way to estimate them. We perform a series of experiments with several feature selection algorithms on a set of proteomics datasets. The experiments allow us to explore the merits of each stability measure and create stability profiles of the feature selection algorithms. Finally, we show how stability profiles can support the choice of a feature selection algorithm. Alexandros Kalousis received the B.Sc. degree in computer science, in 1994, and the M.Sc. degree in advanced information systems, in 1997, both from the University of Athens, Greece. He received the Ph.D. degree in meta-learning for classification algorithm selection from the University of Geneva, Department of Computer Science, Geneva, in 2002. Since then he is a Senior Researcher in the same university. His research interests include relational learning with kernels and distances, stability of feature selection algorithms, and feature extraction from spectral data. Julien Prados is a Ph.D. student at the University of Geneva, Switzerland. In 1999 and 2001, he received the B.Sc. and M.Sc. degrees in computer science from the University Joseph Fourier (Grenoble, France). After a year of work in industry, he joined the Geneva Artificial Intelligence Laboratory, where he is working on bioinformatics and datamining tools for mass spectrometry data analysis. Melanie Hilario has a Ph.D. in computer science from the University of Paris VI and currently works at the University of Geneva’s Artificial Intelligence Laboratory. She has initiated and participated in several European research projects on neuro-symbolic integration, meta-learning, and biological text mining. She has served on the program committees of many conferences and workshops in machine learning, data mining, and artificial intelligence. She is currently an Associate Editor of theInternational Journal on Artificial Intelligence Toolsand a member of the Editorial Board of theIntelligent Data Analysis journal.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号