首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Semantics-preserving dimensionality reduction refers to the problem of selecting those input features that are most predictive of a given outcome; a problem encountered in many areas such as machine learning, pattern recognition, and signal processing. This has found successful application in tasks that involve data sets containing huge numbers of features (in the order of tens of thousands), which would be impossible to process further. Recent examples include text processing and Web content classification. One of the many successful applications of rough set theory has been to this feature selection area. This paper reviews those techniques that preserve the underlying semantics of the data, using crisp and fuzzy rough set-based methodologies. Several approaches to feature selection based on rough set theory are experimentally compared. Additionally, a new area in feature selection, feature grouping, is highlighted and a rough set-based feature grouping technique is detailed.  相似文献   

2.
特征选择旨在降低待处理数据的维度,剔除冗余特征,是机器学习领域的关键问题之一。现有的半监督特征选择方法一般借助图模型提取数据集的聚类结构,但其所提取的聚类结构缺乏清晰的边界,影响了特征选择的效果。为此,提出一种基于稀疏图表示的半监督特征选择方法,构建了聚类结构和特征选择的联合学习模型,采用l__1范数约束图模型以得到清晰的聚类结构,并引入l_2,1范数以避免噪声的干扰并提高特征选择的准确度。为了验证本方法的有效性,选择了目前流行的几种特征方法进行对比分析,实验结果表明了本方法的有效性。  相似文献   

3.
Reducing the dimensionality of the data has been a challenging task in data mining and machine learning applications. In these applications, the existence of irrelevant and redundant features negatively affects the efficiency and effectiveness of different learning algorithms. Feature selection is one of the dimension reduction techniques, which has been used to allow a better understanding of data and improve the performance of other learning tasks. Although the selection of relevant features has been extensively studied in supervised learning, feature selection in the absence of class labels is still a challenging task. This paper proposes a novel method for unsupervised feature selection, which efficiently selects features in a greedy manner. The paper first defines an effective criterion for unsupervised feature selection that measures the reconstruction error of the data matrix based on the selected subset of features. The paper then presents a novel algorithm for greedily minimizing the reconstruction error based on the features selected so far. The greedy algorithm is based on an efficient recursive formula for calculating the reconstruction error. Experiments on real data sets demonstrate the effectiveness of the proposed algorithm in comparison with the state-of-the-art methods for unsupervised feature selection.  相似文献   

4.
大数据的发展对数据分类领域的分类准确性有了更高的要求;支持向量机(Support Vector Machine,SVM)的广泛应用需要一种高效的方法来构造一个分类能力强的SVM分类器;SVM的核函数参数与惩罚因子以及特征子集对预测模型的复杂度和预测精度有着重要影响。为提高SVM的分类性能,文中将SVM的渐近性融合到灰狼优化(Grey Wolf Optimization,GWO)算法中,提出了新的SVM分类器模型,该模型对SVM的参数与数据的特征子集同时进行优化,融合SVM渐近性的新灰狼个体将灰狼优化算法的搜索空间导向超参数空间中的最佳区域,能够更快地获得最优解;此外,将获得的分类准确率、所选特征个数和支持向量个数相结合,提出了一种新的适应度函数,新的适应度函数与融合渐近性的灰狼优化算法将搜索引向最优解。采用UCI中的多个经典数据集对所提模型进行验证,将其与网格搜素算法、未融合渐近性的灰狼优化算法以及其他文献中的方法进行对比,其分类准确率在不同数据集上均有不同程度的提升。实验结果表明,所提算法能找到SVM的最优参数与最小特征子集,具有更高的分类准确率和更短的平均处理时间。  相似文献   

5.
特征选择是机器学习和数据挖据中一个重要的预处理步骤,而类别不均衡数据的特征选择是机器学习和模式识别中的一个热点研究问题。多数传统的特征选择分类算法追求高精度,并假设数据没有误分类代价或者有同样的代价。在现实应用中,不同的误分类往往会产生不同的误分类代价。为了得到最小误分类代价下的特征子集,本文提出一种基于样本邻域保持的代价敏感特征选择算法。该算法的核心思想是把样本邻域引入现有的代价敏感特征选择框架。在8个真实数据集上的实验结果表明了该算法的优越性。  相似文献   

6.
In pattern recognition field, objects are usually represented by multiple features (multimodal features). For example, to characterize a natural scene image, it is essential to extract a set of visual features representing its color, texture, and shape information. However, integrating multimodal features for recognition is challenging because: (1) each feature has its specific statistical property and physical interpretation, (2) huge number of features may result in the curse of dimensionality (When data dimension is high, the distances between pairwise objects in the feature space become increasingly similar due to the central limit theory. This phenomenon influences negatively to the recognition performance), and (3) some features may be unavailable. To solve these problems, a new multimodal feature selection algorithm, termed Grassmann manifold feature selection (GMFS), is proposed. In particular, by defining a clustering criterion, the multimodal features are transformed into a matrix, and further treated as a point on the Grassmann manifold in Hamm and Lee (Grassmann discriminant analysis: a unifying view on subspace-based learning. In: Proceedings of the 25th international conference on machine learning (ICML), pp. 376–383, Helsinki, Finland [2008]). To deal with the unavailable features, L2-Hausdorff distance, a metric between different-sized matrices, is computed and the kernel is obtained accordingly. Based on the kernel, we propose supervised/unsupervised feature selection algorithms to achieve a physically meaningful embedding of the multimodal features. Experimental results on eight data sets validate the effectiveness the proposed approach.  相似文献   

7.
随着互联网和物联网技术的发展,数据的收集变得越发容易。但是,高维数据中包含了很多冗余和不相关的特征,直接使用会徒增模型的计算量,甚至会降低模型的表现性能,故很有必要对高维数据进行降维处理。特征选择可以通过减少特征维度来降低计算开销和去除冗余特征,以提高机器学习模型的性能,并保留了数据的原始特征,具有良好的可解释性。特征选择已经成为机器学习领域中重要的数据预处理步骤之一。粗糙集理论是一种可用于特征选择的有效方法,它可以通过去除冗余信息来保留原始特征的特性。然而,由于计算所有的特征子集组合的开销较大,传统的基于粗糙集的特征选择方法很难找到全局最优的特征子集。针对上述问题,文中提出了一种基于粗糙集和改进鲸鱼优化算法的特征选择方法。为避免鲸鱼算法陷入局部优化,文中提出了种群优化和扰动策略的改进鲸鱼算法。该算法首先随机初始化一系列特征子集,然后用基于粗糙集属性依赖度的目标函数来评价各子集的优劣,最后使用改进鲸鱼优化算法,通过不断迭代找到可接受的近似最优特征子集。在UCI数据集上的实验结果表明,当以支持向量机为评价所用的分类器时,文中提出的算法能找到具有较少信息损失的特征子集,且具有较高的分类精度。因此,所提算法在特征选择方面具有一定的优势。  相似文献   

8.
特征选择是机器学习、模式识别和数据挖掘等领域数据预处理阶段的重要步骤.现实中采集的数据维度很高,存在大量冗余和噪声数据,这使得计算时间增加的同时还会对建模结果产生误导性.结合属性子集的广义重要度和智能优化runner-root算法提出一种特征选择算法,用runner-root算法进行迭代寻优,用属性子集的广义重要度和所...  相似文献   

9.
张帆  杜博  张良培  张乐飞 《计算机科学》2014,41(12):275-279
如何准确识别图像中的类别信息,是计算机视觉和模式识别领域的重要研究问题。遥感卫星图像数据,尤其是高光谱等遥感图像数据的出现,将空间信息与光谱信息集成于同一数据集中,丰富了图像信息来源。如何准确地识别高光谱图像中的地物类别,已经成为了图像处理和模式识别领域的热点问题。面向高光谱图像数据提出了一种基于波段分组特征和形态学特征的高光谱图像分类方法,结合空间和光谱特征提高分类精度。通过真实的高光谱数据实验证明:利用波段分组可以有效地保持光谱特征,降低数据冗余;在波段分组基础上结合形态学特征进行分类,比传统分类方法的分类精度明显提高。  相似文献   

10.
基于机器学习的迭代编译方法可以在对新程序进行迭代编译时,有效预测新程序的最佳优化参数组合。现有方法在模型训练过程中存在优化参数组合搜索效率较低、程序特征表示不恰当、预测精度不高的问题。因此,基于机器学习的迭代编译方法是当前迭代编译领域内的一个研究热点,其研究挑战在于学习算法选择、优化参数搜索以及程序特征表示等问题。基于监督学习技术,提出了一种程序优化参数预测方法。该方法首先通过约束多目标粒子群算法对优化参数空间进行搜索,找到样本函数的最佳优化参数;然后,通过动静结合的程序特征表示技术,对函数特征进行抽取;最后,通过由函数特征和优化参数形成的样本构建监督学习模型,对新程序的优化参数进行预测。分别采用k近邻法和softmax回归建立统计模型,实验结果表明,新方法在NPB测试集和大型科学计算程序上实现了较好的预测性能。  相似文献   

11.

In medical information system, there are a lot of features and the relationship among elements is solid. In this way, feature selection of medical datasets gets awesome worry as of late. In this article, tolerance rough set firefly-based quick reduct, is developed and connected to issue of differential finding of diseases. The hybrid intelligent framework intends to exploit the advantages of the fundamental models and, in the meantime, direct their restrictions. Feature selection is procedure for distinguishing ideal feature subset of the original features. A definitive point of feature selection is to build the precision, computational proficiency and adaptability of expectation strategy in machine learning, design acknowledgment and information mining applications. Along these lines, the learning framework gets a brief structure without lessening the prescient precision by utilizing just the chose remarkable features. In this research, a hybridization of two procedures, tolerance rough set and as of late created meta-heuristic enhancement calculation, the firefly algorithm is utilized to choose the conspicuous features of medicinal information to have the capacity to characterize and analyze real sicknesses. The exploratory results exhibited that the proficiency of the proposed system outflanks the current supervised feature selection techniques.

  相似文献   

12.
针对在采用机器视觉的无夹具定位的壳体类零件几何参数检测过程中,需要先智能识别零件几何特征以规划检测路径的问题,提出一种基于监督式机器学习的几何特征智能识别方法。利用壳体零件待识别特征的中心位置关系构成特征矩阵,利用监督式机器学习算法进行识别,提出一种基于特征唯一性的纠错方法对分类过程中产生的识别错误进行纠正。对于所涉研究实例,零件共有4个待识别孔,在5次监督式训练后智能识别准确度达100%。  相似文献   

13.
将迁移学习和数据分组处理算法集成起来,提出了一种基于数据分组处理算法的迁移特征选择(GM-DH-TFS)模型。在UCI的四个数据集上,将GMDH-TFS模型与以全部特征作分类(FULL)的结果以及常用的特征选择模型(前向监督特征选择模型(SFFS)、前向半监督特征选择模型(FW-SemiFS)和迁移特征选择模型(TFS))作比较实验,结果表明,GMDH-TFS在特征选择方面比其他四种方法有更好的效果,在小样本情况下也得到了同样的结果。GMDH-TFS模型可以在数据分布不一致的情况下进行特征选择,同时面对数据匮乏也能取得理想的效果。  相似文献   

14.

Cancer classification is one of the main steps during patient healing process. This fact enforces modern clinical researchers to use advanced bioinformatics methods for cancer classification. Cancer classification is usually performed using gene expression data gained in microarray experiment and advanced machine learning methods. Microarray experiment generates huge amount of data, and its processing via machine learning methods represents a big challenge. In this study, two-step classification paradigm which merges genetic algorithm feature selection and machine learning classifiers is utilized. Genetic algorithm is built in MapReduce programming spirit which makes this algorithm highly scalable for Hadoop cluster. In order to improve the performance of the proposed algorithm, it is extended into a parallel algorithm which process on microarray data in distributed manner using the Hadoop MapReduce framework. In this paper, the algorithm was tested on eleven GEMS data sets (9 tumors, 11 tumors, 14 tumors, brain tumor 1, lung cancer, brain tumor 2, leukemia 1, DLBCL, leukemia 2, SRBCT, and prostate tumor) and its accuracy reached 100% for less than 25 selected features. The proposed cloud computing-based MapReduce parallel genetic algorithm performed well on gene expression data. In addition, the scalability of the suggested algorithm is unlimited because of underlying Hadoop MapReduce platform. The presented results indicate that the proposed method can be effectively implemented for real-world microarray data in the cloud environment. In addition, the Hadoop MapReduce framework demonstrates substantial decrease in the computation time.

  相似文献   

15.
Models based on data mining and machine learning techniques have been developed to detect the disease early or assist in clinical breast cancer diagnoses. Feature selection is commonly applied to improve the performance of models. There are numerous studies on feature selection in the literature, and most of the studies focus on feature selection in supervised learning. When class labels are absent, feature selection methods in unsupervised learning are required. However, there are few studies on these methods in the literature. Our paper aims to present a hybrid intelligence model that uses the cluster analysis techniques with feature selection for analyzing clinical breast cancer diagnoses. Our model provides an option of selecting a subset of salient features for performing clustering and comprehensively considers the use of most existing models that use all the features to perform clustering. In particular, we study the methods by selecting salient features to identify clusters using a comparison of coincident quantitative measurements. When applied to benchmark breast cancer datasets, experimental results indicate that our method outperforms several benchmark filter- and wrapper-based methods in selecting features used to discover natural clusters, maximizing the between-cluster scatter and minimizing the within-cluster scatter toward a satisfactory clustering quality.  相似文献   

16.
A machine learning framework which uses unlabeled data from a related task domain in supervised classification tasks is described. The unlabeled data come from related domains, which share the same class labels or generative distribution as the labeled data. Patterns in the unlabeled data are learned via a neural network and transferred to the target domain from where the labeled data are generated, so as to improve the performance of the supervised learning task. We call this approach self-taught transfer learning from unlabeled data. We introduce a general-purpose feature learning algorithm producing features that retain information from the unlabeled data. Information preservation assures that the features obtained will be useful for improving the classification performance of the supervised tasks.  相似文献   

17.
容忍噪音的特征子集选择算法研究   总被引:4,自引:0,他引:4  
特征子集选择问题一直是人工智能领域研究的重要内容,特别是近几年来,特征子集选择算法研究已经成为机器学习和数据挖掘等领域的研究热点,提出了一个新的特征子集选择算法-容忍噪音的特征子集选择算法(NFS),该算法将聚类的思想引入到噪音的处理,并将Gini系数和墨西哥帽函数应用于特征选取,实现对偏吸噪音数据集的特征子集选择,实际领域的实验结果表明,NFS算法具有噪音容忍度高,选择特征代表性强和求解速度快的优点,因此能够有效地应用于实际领域。  相似文献   

18.

Code smell detection is essential to improve software quality, enhancing software maintainability, and decrease the risk of faults and failures in the software system. In this paper, we proposed a code smell prediction approach based on machine learning techniques and software metrics. The local interpretable model-agnostic explanations (LIME) algorithm was further used to explain the machine learning model’s predictions and interpretability. The datasets obtained from Fontana et al. were reformed and used to build binary-label and multi-label datasets. The results of 10-fold cross-validation show that the performance of tree-based algorithms (mainly Random Forest) is higher compared with kernel-based and network-based algorithms. The genetic algorithm based feature selection methods enhance the accuracy of these machine learning algorithms by selecting the most relevant features in each dataset. Moreover, the parameter optimization techniques based on the grid search algorithm significantly enhance the accuracy of all these algorithms. Finally, machine learning techniques have high potential in predicting the code smells, which contribute to detect these smells and enhance the software’s quality.

  相似文献   

19.
Recently, multi‐ and many‐objective meta‐heuristic algorithms have received considerable attention due to their capability to solve optimization problems that require more than one fitness function. This paper presents a comprehensive study of these techniques applied in the context of machine learning problems. Three different topics are reviewed in this work: (a) feature extraction and selection, (b) hyper‐parameter optimization and model selection in the context of supervised learning, and (c) clustering or unsupervised learning. The survey also highlights future research towards related areas.  相似文献   

20.
Cervical cancer is one of the vital and most frequent cancers, but can be cured effectively if diagnosed in the early stage. This is a novel effort towards effective characterization of cervix lesions from contrast enhanced CT-Scan images to provide a reliable and objective discrimination between benign and malignant lesions. Performance of such classification models mostly depends on features used to represent samples in a training dataset. Selection of optimal feature subset here is NP-hard; where, randomized algorithms do better. In this paper, Grey Wolf Optimizer (GWO), which is a population based meta-heuristic inspired by the leadership hierarchy and hunting mechanism of grey wolves has been utilized for feature selection. The traditional GWO is applicable for continuous single objective optimization problems. Since, feature selection is inherently multi-objective; this paper proposes two different approaches for multi-objective binary GWO algorithms. One is a scalarized approach to multi-objective GWO (MOGWO) and the other is a Non-dominated Sorting based GWO (NSGWO). These are used for wrapper based feature selection that selects optimal textural feature subset for improved classification of cervix lesions. For experiments, contrast enhanced CT-Scan (CECT) images of 62 patients have been used, where all lesions had been recommended for surgical biopsy by specialist. Gray-level co-occurrence matrix based texture features are extracted from two-level decomposition of wavelet coefficients of cervix regions extracted from CECT images. The results of proposed approaches are compared with mostly used meta-heuristics such as genetic algorithm (GA) and firefly algorithm (FA) for multi-objective optimization. With better diversification and intensification, GWO obtains Pareto solutions, which dominate the solutions obtained by GA and FA when assessed on the utilized cervix lesion cases. Cervix lesions are up to 91% accurately classified as benign and malignant with only five features selected by NSGWO. A two-tailed t-test was conducted by hypothesizing the mean F-score obtained by the proposed NSGWO method at significance level = 0.05. This confirms that NSGWO performs significantly better than other methods for the real cervix lesion dataset in hand. Further experiments were conducted on high dimensional microarray gene expression datasets collected online. The results demonstrate that the proposed method performs significantly better than other methods selecting relevant genes for high-dimensional, multi-category cancer diagnosis with an average of 12.82% improvement in F-score value.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号