首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
    
In many domains, important events are not represented as the common scenario, but as deviations from the rule. The importance and impact associated with these particular, outnumbered, deviant, and sometimes even previously unseen events is directly related to the application domain (e.g., breast cancer detection, satellite image classification, etc.). The detection of these rare events or outliers has recently been gaining popularity as evidenced by the wide variety of algorithms currently available. These algorithms are based on different assumptions about what constitutes an outlier, a characteristic pointing toward their integration in an ensemble to improve their individual detection rate. However, there are two factors that limit the use of current ensemble outlier detection approaches: first, in most cases, outliers are not detectable in full dimensionality, but instead are located in specific subspaces of data; and second, despite the expected improvement on detection rate achieved using an ensemble of detectors, the computational efficiency of the ensemble will increase linearly as the number of components increases. In this article, we propose an ensemble approach that identifies outliers based on different subsets of features and subsamples of data, providing more robust results while improving the computational efficiency of similar ensemble outlier detection approaches.  相似文献   

2.
提出了一种新的基于边缘分类能力排序准则,用于基于排序聚集(ordered aggregation,OA)的分类器选择算法.为了表征分类器的分类能力,使用随机参考分类器对原分类器进行模拟,从而获得分类能力的概率模型.为了提高分类器集成性能,将提出的基于边缘分类能力的排序准则与动态集成选择算法相结合,首先将特征空间划分成不同能力的区域,然后在每个划分内构造最优的分类器集成,最后使用动态集成选择算法对未知样本进行分类.在UCI数据集上进行的实验表明,对比现有的排序准则,边缘分类能力的排序准则效果更好,进一步实验表明,基于边缘分类能力的动态集成选择算法较现有分类器集成算法具有分类正确率更高、集成规模更小、分类时间更短的优势.  相似文献   

3.
杜政霖  李云 《计算机应用》2017,37(3):866-870
针对既有历史数据又有流特征的全新应用场景,提出了一种基于组特征选择和流特征的在线特征选择算法。在对历史数据的组特征选择阶段,为了弥补单一聚类算法的不足,引入聚类集成的思想。先利用k-means方法通过多次聚类得到一个聚类集体,在集成阶段再利用层次聚类算法对聚类集体进行集成得到最终的结果。在对流特征数据的在线特征选择阶段,对组构造产生的特征组通过探讨特征间的相关性来更新特征组,最终通过组变换获得特征子集。实验结果表明,所提算法能有效应对全新场景下的在线特征选择问题,并且有很好的分类性能。  相似文献   

4.
向欣  陆歌皓 《计算机应用研究》2021,38(12):3604-3610
针对现实信用评估业务中样本类别不平衡和代价敏感的情况,为降低信用风险评估的误分类损失,提出一种基于DESMID-AD动态选择的信用评估集成模型,根据每一个测试样本的特点动态地选择合适的基分类器对其进行信用预测.为提高模型对信用差客户(小类)的识别能力,在基分类器训练前使用过采样的方法对训练数据作类别平衡,采用元学习的方式基于多个指标进行基分类器的性能评估并在此阶段设计权重机制增强小类的影响.在三个公开信用评估数据集上,以AUC、一型、二型错误率以及误分类代价作为评价指标,与九种信用评估常用模型做比较,证明了该方法在信用评估领域的有效性和可行性.  相似文献   

5.
特征选择方法与算法的研究   总被引:1,自引:0,他引:1  
特征选择的主要思想是通过去除一些包含少量或不相关的信息的特征去选择特征子集。特征选择方法可分为三大类:一是过滤式,二是封装式,三是嵌入式。鉴于目前存在大量的特征选择算法,为了能够适当地决定在特定的情况下使用哪种算法,需要提出可以依赖或判定的标准。文中的主要工作就是综述一些基本特征选择算法,根据文献中已有的理论和实验结果对特征选择方法和算法进行比较分类,然后提出一种可以依赖或判定的标准。  相似文献   

6.
针对基于约束得分的特征选择容易受成对约束的组成和基数影响的问题, 提出了一种基于约束得分的动态集成选择算法(dynamic ensemble selection based on bagging constraint score, BCS-DES)。该算法将bagging约束得分(bagging constraint score, BCS)引入动态集成选择算法, 通过将样本空间划分为不同的区域, 使用多种群并行遗传算法为不同测试样本选择局部最优的分类集成, 达到提高分类精度的目的。在UCI实验数据集上进行的实验表明, BCS-DES算法较现有的特征选择算法受成对约束组成和基数影响更小, 效果更好。  相似文献   

7.
         下载免费PDF全文
Learning from noisy data is a challenging task for data mining research. In this paper, we argue that for noisy data both global bagging strategy and local bagging strategy su er from their own inherent disadvantages and thus cannot form accurate prediction models. Consequently, we present a Global and Local Bagging (called Glocal Bagging:GB) approach to tackle this problem. GB assigns weight values to the base classi ers under the consideration that: (1) for each test instance Ix, GB prefers bags close to Ix, which is the nature of the local learning strategy; (2) for base classi ers, GB assigns larger weight values to the ones with higher accuracy on the out-of-bag, which is the nature of the global learning strategy. Combining (1) and (2), GB assign large weight values to the classi ers which are close to the current test instance Ix and have high out-of-bag accuracy. The diversity/accuracy analysis on synthetic datasets shows that GB improves the classi er ensemble's performance by increasing its base classi er's accuracy. Moreover, the bias/variance analysis also shows that GB's accuracy improvement mainly comes from the reduction of the bias error. Experiment results on 25 UCI benchmark datasets show that when the datasets are noisy, GB is superior to other former proposed bagging methods such as the classical bagging, bragging, nice bagging, trimmed bagging and lazy bagging.  相似文献   

8.
    
Credit scoring focuses on the development of empirical models to support the financial decision‐making processes of financial institutions and credit industries. It makes use of applicants' historical data and statistical or machine learning techniques to assess the risk associated with an applicant. However, the historical data may consist of redundant and noisy features that affect the performance of credit scoring models. The main focus of this paper is to develop a hybrid model, combining feature selection and a multilayer ensemble classifier framework, to improve the predictive performance of credit scoring. The proposed hybrid credit scoring model is modeled in three phases. The initial phase constitutes preprocessing and assigns ranks and weights to classifiers. In the next phase, the ensemble feature selection approach is applied to the preprocessed dataset. Finally, in the last phase, the dataset with the selected features is used in a multilayer ensemble classifier framework. In addition, a classifier placement algorithm based on the Choquet integral value is designed, as the classifier placement affects the predictive performance of the ensemble framework. The proposed hybrid credit scoring model is validated on real‐world credit scoring datasets, namely, Australian, Japanese, German‐categorical, and German‐numerical datasets.  相似文献   

9.
随着互联网金融和电子支付业务的高速增长,由此引发的个人信用问题也呈现与日俱增的态势.个人信用预测本质上是不平衡的序列二分类问题,这类问题的数据样本规模大、维度高、数据分布极不平衡.为了高效区分申请者的信用情况,本文提出一种基于特征优化和集成学习的个人信用预测方法 (PL-SmoteBoost).该方法在Boosting集成框架下构建个人信用预测模型,首先利用Pearson相关系数对数据进行初始化分析,剔除冗余数据;通过Lasso选取部分特征来减少数据维度,降低高维风险;通过SMOTE过采样方法对降维数据的少数类进行线性插值,以解决类不平衡问题;最后为了验证算法有效性,以常用的处理二分类问题的算法作为对比方法,采用从Kaggle和微软开放数据库下载的高纬度不平衡数据集对算法进行测试,以AUC作为算法的评价指标,利用统计检验手段对实验结果进行分析.结果表明,相对于其他算法,本文提出的PL-SmoteBoost算法具有显著优势.  相似文献   

10.
李云 《微型机与应用》2012,31(15):1-2,5
特征选择是机器学习和数据挖掘领域的关键问题之一,而特征选择的稳定性也是目前的一个研究热点。主要对特征选择的稳定性因素和稳定性度量进行分析,并详细介绍了目前比较经典的两种提高特征选择稳定性的方法。  相似文献   

11.
    
Credit risk assessment has been a crucial issue as it forecasts whether an individual will default on loan or not. Classifying an applicant as good or bad debtor helps lender to make a wise decision. The modern data mining and machine learning techniques have been found to be very useful and accurate in credit risk predictive capability and correct decision making. Classification is one of the most widely used techniques in machine learning. To increase prediction accuracy of standalone classifiers while keeping overall cost to a minimum, feature selection techniques have been utilized, as feature selection removes redundant and irrelevant attributes from dataset. This paper initially introduces Bolasso (Bootstrap-Lasso) which selects consistent and relevant features from pool of features. The consistent feature selection is defined as robustness of selected features with respect to changes in dataset Bolasso generated shortlisted features are then applied to various classification algorithms like Random Forest (RF), Support Vector Machine (SVM), Naïve Bayes (NB) and K-Nearest Neighbors (K-NN) to test its predictive accuracy. It is observed that Bolasso enabled Random Forest algorithm (BS-RF) provides best results forcredit risk evaluation. The classifiers are built on training and test data partition (70:30) of three datasets (Lending Club’s peer to peer dataset, Kaggle’s Bank loan status dataset and German credit dataset obtained from UCI). The performance of Bolasso enabled various classification algorithms is then compared with that of other baseline feature selection methods like Chi Square, Gain Ratio, ReliefF and stand-alone classifiers (no feature selection method applied). The experimental results shows that Bolasso provides phenomenal stability of features when compared with stability of other algorithms. Jaccard Stability Measure (JSM) is used to assess stability of feature selection methods. Moreover BS-RF have good classification accuracy and is better than other methods in terms of AUC and Accuracy resulting in effectively improving the decision making process of lenders.  相似文献   

12.
针对传统信用评估方法分类精度低、特征可解释性差等问题,提出了一种使用稀疏贝叶斯学习方法来进行个人信用评估的模型(SBLCredit)。SBLCredit充分利用稀疏贝叶斯学习的优势,在添加的特征权重的先验知识的情况下进行求解,使得特征权重尽量稀疏,以此实现个人信用评估和特征选择。在德国和澳大利亚真实信用数据集上,SBLCredit方法的分类精度比传统的K近邻、朴素贝叶斯、决策树和支持向量机平均提高了4.52%,6.40%,6.26%和2.27%。实验结果表明,SBLCredit分类精度高,选择的特征少,是一种有效的个人信用评估方法。  相似文献   

13.
基于选择性集成策略的嵌入式网络流特征选择   总被引:1,自引:0,他引:1  
机器学习在网络流量分类中存在特征选择度量指标单一、类别不平衡和概念漂移等问题,使得模型复杂度提高、泛化能力下降.该文提出基于选择性集成策略的嵌入式特征选择方法,根据选择性集成策略选取部分特征选择器集成,再改进序列前向搜索和封装器组合方法二次搜索最优特征子集.实验结果表明该算法在保证分类效果的同时有效降低了特征子集复杂度,从而达到了分类效果、效率和稳定性的最优平衡.  相似文献   

14.
目前移动恶意软件数量呈爆炸式增长,变种层出不穷,日益庞大的特征库增加了安全厂商在恶意软件样本处理方面的难度,传统的检测方式已经不能及时有效地处理软件行为样本大数据。基于机器学习的移动恶意软件检测方法存在特征数量多、检测准确率低和不平衡数据的问题。针对现存的问题,文章提出了基于均值和方差的特征选择方法,以减少对分类无效的特征;实现了基于不同特征提取技术的集合分类方法,包括主成分分析、Kaehunen-Loeve 变换和独立成分分析,以提高检测的准确性。针对软件样本的不平衡数据,文章提出了基于决策树的多级分类集成模型。实验结果表明,文章提出的三种检测方法都可以有效地检测 Android平台中的恶意软件样本,准确率分别提高了6.41%、3.96%和3.36%。  相似文献   

15.
对于具有多特征的复杂数据,使用子数据集作为聚类成员的输入并使用加权投票的聚类集成方法可以权衡不同聚类成员的质量,提高聚类的准确性和稳定性。针对子数据集的选择及权重的计算方式,提出了最小相关特征的子数据集选取方法,并基于特征关系分析比较了五种聚类成员的权重计算方法。实验结果表明,使用最小相关特征法选择每个聚类成员的输入数据,相比随机抽样法可提高聚类集成的准确率。基于五种权重计算方法的聚类集成准确率都比单聚类高,且时间消耗有明显差异。  相似文献   

16.
    
Feature selection is a process aimed at filtering out unrepresentative features from a given dataset, usually allowing the later data mining and analysis steps to produce better results. However, different feature selection algorithms use different criteria to select representative features, making it difficult to find the best algorithm for different domain datasets. The limitations of single feature selection methods can be overcome by the application of ensemble methods, combining multiple feature selection results. In the literature, feature selection algorithms are classified as filter, wrapper, or embedded techniques. However, to the best of our knowledge, there has been no study focusing on combining these three types of techniques to produce ensemble feature selection. Therefore, the aim here is to answer the question as to which combination of different types of feature selection algorithms offers the best performance for different types of medical data including categorical, numerical, and mixed data types. The experimental results show that a combination of filter (i.e., principal component analysis) and wrapper (i.e., genetic algorithms) techniques by the union method is a better choice, providing relatively high classification accuracy and a reasonably good feature reduction rate.  相似文献   

17.
何劲松 《计算机学报》2007,30(2):168-175
允许经验风险不为0是现代模式分类器构造方法区别于传统模式分类器构造方法的标志.为了进一步研究分类器构造观点的变化对模式分类系统所产生的更深入的影响,拓展模式分类系统的学习空间,作者讨论了限制经验风险必须为0的传统模式分类系统在分类性能问题上所受的限制,分析了影响模式分类系统分类性能的关键因素,给出了学习空间可拓展的必要条件,并构造了一种投机学习方法,证明了学习空间可拓展的充分条件.同时,在实验中观察到,分类器评价与测试集上的分类风险是非一致单调的.这一结论对于模式识别及其应用研究是严峻的.  相似文献   

18.
Bagging and boosting are methods that generate a diverse ensemble of classifiers by manipulating the training data given to a base learning algorithm. Breiman has pointed out that they rely for their effectiveness on the instability of the base learning algorithm. An alternative approach to generating an ensemble is to randomize the internal decisions made by the base algorithm. This general approach has been studied previously by Ali and Pazzani and by Dietterich and Kong. This paper compares the effectiveness of randomization, bagging, and boosting for improving the performance of the decision-tree algorithm C4.5. The experiments show that in situations with little or no classification noise, randomization is competitive with (and perhaps slightly superior to) bagging but not as accurate as boosting. In situations with substantial classification noise, bagging is much better than boosting, and sometimes better than randomization.  相似文献   

19.
多分类器组合是解决复杂模式识别问题的有效办法。文章提出了一种新的双层多分类器组合算法,首先利用分类对象的主次特征构建了多个差异的融合方案,然后对这些融合方案进行最终的组合决策。实验结果表明,对于复杂分类问题,本文算法具有较高的正确识别率。  相似文献   

20.
    
Working as an ensemble method that establishes a committee of classifiers first and then aggregates their outcomes through majority voting, bagging has attracted considerable research interest and been applied in various application domains. It has demonstrated several advantages, but in its present form, bagging has been found to be less accurate than some other ensemble methods. To unlock its power and expand its user base, we propose an approach that improves bagging through the use of multi-algorithm ensembles. In a multi-algorithm ensemble, multiple classification algorithms are employed. Starting from a study of the nature of diversity, we show that compared to using different training sets alone, using heterogeneous algorithms together with different training sets increases diversity in ensembles, and hence we provide a fundamental explanation for research utilizing heterogeneous algorithms. In addition, we partially address the problem of the relationship between diversity and accuracy by providing a non-linear function that describes the relationship between diversity and correlation. Furthermore, after realizing that the bootstrap procedure is the exclusive source of diversity in bagging, we use heterogeneity as another source of diversity and propose an approach utilizing heterogeneous algorithms in bagging. For evaluation, we consider several benchmark data sets from various application domains. The results indicate that, in terms of F1-measure, our approach outperforms most of the other state-of-the-art ensemble methods considered in experiments and, in terms of mean margin, our approach is superior to all the others considered in experiments.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号

京公网安备 11010802026262号