首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
针对大数据样例选择问题,提出了一种基于随机森林(RF)和投票机制的大数据样例选择算法。首先,将大数据集划分成两个子集,要求第一个子集是大型的,第二个子集是中小型的。然后,将第一个大型子集划分成q个规模较小的子集,并将这些子集部署到q个云计算节点,并将第二个中小型子集广播到q个云计算节点。接下来,在各个节点用本地数据子集训练随机森林,并用随机森林从第二个中小型子集中选择样例,之后合并在各个节点选择的样例以得到这一次所选样例的子集。重复上述过程p次,得到p个样例子集。最后,用这p个子集进行投票,得到最终选择的样例子集。在Hadoop和Spark两种大数据平台上实现了提出的算法,比较了两种大数据平台的实现机制。此外,在6个大数据集上将所提算法与压缩最近邻(CNN)算法和约简最近邻(RNN)算法进行了比较,实验结果显示数据集的规模越大时,与这两个算法相比,提出的算法测试精度更高且时间消耗更短。证明了提出的算法在大数据处理上具有良好的泛化能力和较高的运行效率,可以有效地解决大数据的样例选择问题。  相似文献   

2.
A genetic algorithm-based method for feature subset selection   总被引:5,自引:2,他引:3  
As a commonly used technique in data preprocessing, feature selection selects a subset of informative attributes or variables to build models describing data. By removing redundant and irrelevant or noise features, feature selection can improve the predictive accuracy and the comprehensibility of the predictors or classifiers. Many feature selection algorithms with different selection criteria has been introduced by researchers. However, it is discovered that no single criterion is best for all applications. In this paper, we propose a framework based on a genetic algorithm (GA) for feature subset selection that combines various existing feature selection methods. The advantages of this approach include the ability to accommodate multiple feature selection criteria and find small subsets of features that perform well for a particular inductive learning algorithm of interest to build the classifier. We conducted experiments using three data sets and three existing feature selection methods. The experimental results demonstrate that our approach is a robust and effective approach to find subsets of features with higher classification accuracy and/or smaller size compared to each individual feature selection algorithm.  相似文献   

3.
Instance selection is becoming increasingly relevant due to the huge amount of data that is constantly being produced in many fields of research. Although current algorithms are useful for fairly large datasets, scaling problems are found when the number of instances is in the hundreds of thousands or millions. When we face huge problems, scalability becomes an issue, and most algorithms are not applicable.Thus, paradoxically, instance selection algorithms are for the most part impracticable for the same problems that would benefit most from their use. This paper presents a way of avoiding this difficulty using several rounds of instance selection on subsets of the original dataset. These rounds are combined using a voting scheme to allow good performance in terms of testing error and storage reduction, while the execution time of the process is significantly reduced. The method is particularly efficient when we use instance selection algorithms that are high in computational cost. The proposed approach shares the philosophy underlying the construction of ensembles of classifiers. In an ensemble, several weak learners are combined to form a strong classifier; in our method several weak (in the sense that they are applied to subsets of the data) instance selection algorithms are combined to produce a strong and fast instance selection method.An extensive comparison of 30 medium and large datasets from the UCI Machine Learning Repository using 3 different classifiers shows the usefulness of our method. Additionally, the method is applied to 5 huge datasets (from three hundred thousand to more than a million instances) with good results and fast execution time.  相似文献   

4.
The suitability of an optimisation algorithm selected from within an algorithm portfolio depends upon the features of the particular instance to be solved. Understanding the relative strengths and weaknesses of different algorithms in the portfolio is crucial for effective performance prediction, automated algorithm selection, and to generate knowledge about the ideal conditions for each algorithm to influence better algorithm design. Relying on well-studied benchmark instances, or randomly generated instances, limits our ability to truly challenge each of the algorithms in a portfolio and determine these ideal conditions. Instead we use an evolutionary algorithm to evolve instances that are uniquely easy or hard for each algorithm, thus providing a more direct method for studying the relative strengths and weaknesses of each algorithm. The proposed methodology ensures that the meta-data is sufficient to be able to learn the features of the instances that uniquely characterise the ideal conditions for each algorithm. A case study is presented based on a comprehensive study of the performance of two heuristics on the Travelling Salesman Problem. The results show that prediction of search effort as well as the best performing algorithm for a given instance can be achieved with high accuracy.  相似文献   

5.
Instance selection is becoming more and more relevant due to the huge amount of data that is being constantly produced. However, although current algorithms are useful for fairly large datasets, scaling problems are found when the number of instances is of hundreds of thousands or millions. In the best case, these algorithms are of efficiency O(n 2), n being the number of instances. When we face huge problems, scalability is an issue, and most algorithms are not applicable. This paper presents a divide-and-conquer recursive approach to the problem of instance selection for instance based learning for very large problems. Our method divides the original training set into small subsets where the instance selection algorithm is applied. Then the selected instances are rejoined in a new training set and the same procedure, partitioning and application of an instance selection algorithm, is repeated. In this way, our approach is based on the philosophy of divide-and-conquer applied in a recursive manner. The proposed method is able to match, and even improve, for the case of storage reduction, the results of well-known standard algorithms with a very significant reduction of execution time. An extensive comparison in 30 datasets form the UCI Machine Learning Repository shows the usefulness of our method. Additionally, the method is applied to 5 huge datasets with from 300,000 to more than a million instances, with very good results and fast execution time.  相似文献   

6.
翟俊海  张素芳  王聪  沈矗  刘晓萌 《计算机应用》2018,38(10):2759-2763
针对传统的主动学习算法只能处理中小型数据集的问题,提出一种基于MapReduce的大数据主动学习算法。首先,在有类别标签的初始训练集上,用极限学习机(ELM)算法训练一个分类器,并将其输出用软最大化函数变换为一个后验概率分布。然后,将无类别标签的大数据集划分为l个子集,并部署到l个云计算节点上。在每一个节点,用训练出的分类器并行地计算各个子集中样例的信息熵,并选择信息熵大的前q个样例进行类别标注,将标注类别的l×q个样例添加到有类别标签的训练集中。重复以上步骤直到满足预定义的停止条件。在Artificial、Skin、Statlog和Poker 4个数据集上与基于ELM的主动学习算法进行了比较,结果显示,所提算法在4个数据集上均能完成主动样例选择,而基于ELM的主动学习算法只在规模最小的数据集上能完成主动样例选择。实验结果表明,所提算法优于基于极限学习机的主动学习算法。  相似文献   

7.
针对许多多示例算法都对正包中的示例情况做出假设的问题,提出了结合模糊聚类的多示例集成算法(ISFC).结合模糊聚类和多示例学习中负包的特点,提出了"正得分"的概念,用于衡量示例标签为正的可能性,降低了多示例学习中示例标签的歧义性;考虑到多示例学习中将负示例分类错误的代价更大,设计了一种包的代表示例选择策略,选出的代表示...  相似文献   

8.
Evolutionary algorithms are adaptive methods based on natural evolution that may be used for search and optimization. As data reduction in knowledge discovery in databases (KDDs) can be viewed as a search problem, it could be solved using evolutionary algorithms (EAs). In this paper, we have carried out an empirical study of the performance of four representative EA models in which we have taken into account two different instance selection perspectives, the prototype selection and the training set selection for data reduction in KDD. This paper includes a comparison between these algorithms and other nonevolutionary instance selection algorithms. The results show that the evolutionary instance selection algorithms consistently outperform the nonevolutionary ones, the main advantages being: better instance reduction rates, higher classification accuracy, and models that are easier to interpret.  相似文献   

9.
Relax-and-Cut algorithms offer an alternative to strengthen Lagrangian relaxation bounds. The main idea behind these algorithms is to dynamically select and dualize inequalities (cuts) within a Lagrangian relaxation framework. This paper proposes a Relax-and-Cut algorithm for the Set Partitioning Problem. Computational tests are reported for benchmark instances from the literature. For Set Partitioning instances with integrality gaps, a variant of the classical Lagrangian relaxation is often used in the literature. It introduces a knapsack constraint to the standard formulation of the problem. Our results indicate that the proposed Relax-and-Cut algorithm outperforms the latter approach in terms of lower bound quality. Furthermore, it turns out to be very competitive in terms of CPU time. Decisive in achieving that performance was the implementation of dominance rules to manage inequalities in the cut pool. The Relax-and-Cut framework proposed here could also be used as a preprocessing tool for Linear Integer Programming solvers. Computational experiments demonstrated that the combined use of our framework and XPRESS improved the performance of that Linear Integer Programming solver for the test sets used in this study.  相似文献   

10.
Much research in the area of constraint processing has recently been focused on extracting small unsatisfiable “cores” from unsatisfiable constraint systems with the goal of finding minimal unsatisfiable subsets (MUSes). While most techniques have provided ways to find an approximation of an MUS (not necessarily minimal), we have developed a sound and complete algorithm for producing all MUSes of an unsatisfiable constraint system. In this paper, we describe a relationship between satisfiable and unsatisfiable subsets of constraints that we subsequently use as the foundation for MUS extraction algorithms, implemented for Boolean satisfiability constraints. The algorithms provide a framework with which many related subproblems can be solved, including relaxations of completeness to handle intractable instances, and we develop several variations of the basic algorithms to illustrate this. Experimental results demonstrate the performance of our algorithms, showing how the base algorithms run quickly on many instances, while the variations are valuable for producing results on instances whose complete results are intractably large. Furthermore, our algorithms are shown to perform better than the existing algorithms for solving either of the two distinct phases of our approach.  相似文献   

11.
Prototype selection problem consists of reducing the size of databases by removing samples that are considered noisy or not influential on nearest neighbour classification tasks. Evolutionary algorithms have been used recently for prototype selection showing good results. However, due to the complexity of this problem when the size of the databases increases, the behaviour of evolutionary algorithms could deteriorate considerably because of a lack of convergence. This additional problem is known as the scaling up problem.

Memetic algorithms are approaches for heuristic searches in optimization problems that combine a population-based algorithm with a local search. In this paper, we propose a model of memetic algorithm that incorporates an ad hoc local search specifically designed for optimizing the properties of prototype selection problem with the aim of tackling the scaling up problem. In order to check its performance, we have carried out an empirical study including a comparison between our proposal and previous evolutionary and non-evolutionary approaches studied in the literature.

The results have been contrasted with the use of non-parametric statistical procedures and show that our approach outperforms previously studied methods, especially when the database scales up.  相似文献   


12.
特征选择技术是机器学习和数据挖掘任务的关键预处理技术。传统贪婪式特征选择方法仅考虑本轮最佳特征,从而导致获取的特征子集仅为局部最优,无法获得最优或者近似最优的特征集合。进化搜索方式则有效地对特征空间进行搜索,然而不同的进化算法在搜索过程中存在自身的局限。本文吸取遗传算法(GA)和粒子群优化算法(PSO)的进化优势,以信息熵度量为评价,通过协同演化的方式获取最终特征子集。并提出适用于特征选择问题特有的比特率交叉算子和信息交换策略。实验结果显示,遗传算法和粒子群协同进化(GA-PSO)在进化搜索特征子集的能力和具体分类学习任务上都优于单独的演化搜索方式。进化搜索提供的组合判断能力优于贪婪式特征选择方法。  相似文献   

13.
One of the most powerful, popular and accurate classification techniques is support vector machines (SVMs). In this work, we want to evaluate whether the accuracy of SVMs can be further improved using training set selection (TSS), where only a subset of training instances is used to build the SVM model. By contrast to existing approaches, we focus on wrapper TSS techniques, where candidate subsets of training instances are evaluated using the SVM training accuracy. We consider five wrapper TSS strategies and show that those based on evolutionary approaches can significantly improve the accuracy of SVMs.  相似文献   

14.
Our confidence in the future performance of any algorithm, including optimization algorithms, depends on how carefully we select test instances so that the generalization of algorithm performance on future instances can be inferred. In recent work, we have established a methodology to generate a 2-d representation of the instance space, comprising a set of known test instances. This instance space shows the similarities and differences between the instances using measurable features or properties, and enables the performance of algorithms to be viewed across the instance space, where generalizations can be inferred. The power of this methodology is the insights that can be generated into algorithm strengths and weaknesses by examining the regions in instance space where strong performance can be expected. The representation of the instance space is dependent on the choice of test instances however. In this paper we present a methodology for generating new test instances with controllable properties, by filling observed gaps in the instance space. This enables the generation of rich new sets of test instances to support better the understanding of algorithm strengths and weaknesses. The methodology is demonstrated on graph colouring as a case study.  相似文献   

15.
李净  郭洪禹 《计算机应用》2012,32(10):2899-2903
针对基于区域的图像检索系统检索精度不高的问题,提出结合文本信息的多示例原型选择算法和反馈标注机制。在示例原型选择时,首先使用文本信息进行正例拓展,然后通过估计负示例分布进行最初示例选择,最后通过示例更新和分类器学习的交替优化获得真的示例原型。相关反馈采用了多策略相结合的主动学习机制,通过信息值控制主动学习策略的自动切换,使系统能够自动选择当前最适合的主动学习策略。实验结果表明,该方法有效且性能优于其他方法。  相似文献   

16.
A reliable and precise classification of tumors is essential for successful treatment of cancer. Gene selection is an important step for improved diagnostics. The modified SFFS (sequential forward floating selection) algorithm based on weighted Mahalanobis distance, called MSWM, is proposed to identify optimal informative gene subsets taking into account joint discriminatory power for accurate discrimination in this study. Firstly, we make use of the one-dimensional weighted Mahalanobis distance to perform a preliminary selection of genes and then make use of the modified SFFS method and multidimensional weighted Mahalanobis distance to obtain the optimal informative gene subset for tumor classification. Finally, we used the k nearest neighbor and naive Bayes methods to classify tumors based on the optimal gene subset selected using the MSWM method. To validate the efficiency, the proposed MSWM method is applied to classify two different DNA microarray datasets. Our empirical study shows that the MSWM method for tumor classification can obtain better effectiveness of classification than the BWR (the ratio of between-groups to within-groups sum of squares) and IVGA_I (independent variable group analysis I) methods. It suggests that the MSWM gene selection method is ability to obtain correct informative gene subsets taking into account genes’ joint discriminatory power for tumor classification.  相似文献   

17.
This paper presents a cooperative evolutionary approach for the problem of instance selection for instance based learning. The model presented takes advantage of one of the recent paradigms in the field of evolutionary computation: cooperative coevolution. This paradigm is based on a similar approach to the philosophy of divide and conquer. In our method, the training set is divided into several subsets that are searched independently. A population of global solutions relates the search in different subsets and keeps track of the best combinations obtained. The proposed model has the advantage over standard methods in that it does not rely on any specific distance metric or classifier algorithm. Additionally, the fitness function of the individuals considers both storage requirements and classification accuracy, and the user can balance both objectives depending on his/her specific needs, assigning different weights to each one of these two terms. The method also shows good scalability when applied to large datasets. The proposed model is favorably compared with some of the most successful standard algorithms, IB3, ICF and DROP3, with a genetic algorithm using CHC method, and with four recent methods of instance selection, MSS, entropy-based instance selection, IMOEA and LVQPRU. The comparison shows a clear advantage of the proposed algorithm in terms of storage requirements, and is, at least, as good as any of the other methods in terms of testing error. A large set of 50 problems from the UCI Machine Learning Repository is used for the comparison. Additionally, a study of the effect of instance label noise is carried out, showing the robustness of the proposed algorithm.  相似文献   

18.
The Synthetic Minority Over Sampling TEchnique (SMOTE) is a widely used technique to balance imbalanced data. In this paper we focus on improving SMOTE in the presence of class noise. Many improvements of SMOTE have been proposed, mostly cleaning or improving the data after applying SMOTE. Our approach differs from these approaches by the fact that it cleans the data before applying SMOTE, such that the quality of the generated instances is better. After applying SMOTE we also carry out data cleaning, such that instances (original or introduced by SMOTE) that badly fit in the new dataset are also removed. To this goal we propose two prototype selection techniques both based on fuzzy rough set theory. The first fuzzy rough prototype selection algorithm removes noisy instances from the imbalanced dataset, the second cleans the data generated by SMOTE. An experimental evaluation shows that our method improves existing preprocessing methods for imbalanced classification, especially in the presence of noise.  相似文献   

19.
提出了KDD中数据预处理的一种基本算法.针对数据库中的属性,利用非监督学习算法,在获取了面向任务的目标数据子集的基础上,利用混合优化算法进行特征子集的选取.分析了遗传算法和混合遗传算法用于特征子集选择的基本算法,仿真实验说明了混合优化算法的有效性和可行性.  相似文献   

20.
A new improved forward floating selection (IFFS) algorithm for selecting a subset of features is presented. Our proposed algorithm improves the state-of-the-art sequential forward floating selection algorithm. The improvement is to add an additional search step called “replacing the weak feature” to check whether removing any feature in the currently selected feature subset and adding a new one at each sequential step can improve the current feature subset. Our method provides the optimal or quasi-optimal (close to optimal) solutions for many selected subsets and requires significantly less computational load than optimal feature selection algorithms. Our experimental results for four different databases demonstrate that our algorithm consistently selects better subsets than other suboptimal feature selection algorithms do, especially when the original number of features of the database is large.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号