首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 187 毫秒
1.
邹薇  王会进 《微型机与应用》2011,30(16):75-77,81
实际应用中大量的不完整的数据集,造成了数据中信息的丢失和分析的不方便,所以对缺失数据的处理已经成为目前分类领域研究的热点。由于EM方法随机选取初始代表簇中心会导致聚类不稳定,本文使用朴素贝叶斯算法的分类结果作为EM算法的初始使用范围,然后按E步M步反复求精,利用得到的最大化值填充缺失数据。实验结果表明,本文的算法加强了聚类的稳定性,具有更好的数据填充效果。  相似文献   

2.
借鉴半监督分类的思想,本文提出一种基于改进EM算法的贝叶斯分类模型,对移动通信网络中存在的大量随机缺失的非平衡数据进行分类。首先,从实际数据中经过初步统计分析得到能在一定程度上反应变量状态的先验概率,并以此作为贝叶斯分类模型的初始值进行EM迭代训练,从而减少EM算法的迭代次数并改善EM算法对初始值的敏感性以及局部收敛的缺陷;然后,利用对历史移动通信数据进行训练得到的叶斯网络分类模型,对测试数据进行预测分类。实验结果表明,该方法大大提高了移动通信数据中负类样本的预测成功率,与传统的数理统计分析方法相比较,表现出了更好的性能。  相似文献   

3.
具有丢失数据的贝叶斯网络结构学习研究   总被引:40,自引:0,他引:40       下载免费PDF全文
王双成  苑森淼 《软件学报》2004,15(7):1042-1048
目前主要基于EM算法和打分-搜索方法进行具有丢失数据的贝叶斯网络结构学习,算法效率较低,而且易于陷入局部最优结构.针对这些问题,建立了一种新的具有丢失数据的贝叶斯网络结构学习方法.首先随机初始化未观察到的数据,得到完整的数据集,并利用完整数据集建立最大似然树作为初始贝叶斯网络结构,然后进行迭代学习.在每一次迭代中,结合贝叶斯网络结构和Gibbs sampling修正未观察到的数据,在新的完整数据集的基础上,基于变量之间的基本依赖关系和依赖分析思想调整贝叶斯网络结构,直到结构趋于稳定.该方法既解决了标准Gi  相似文献   

4.
并行的贝叶斯网络参数学习算法   总被引:2,自引:0,他引:2  
针对大样本条件下EM算法学习贝叶斯网络参数的计算问题,提出一种并行EM算法(Parallel EM,PL-EM)提高大样本条件下复杂贝叶斯网络参数学习的速度.PL-EM算法在E步并行计算隐变量的后验概率和期望充分统计因子;在M步,利用贝叶斯网络的条件独立性和完整数据集下的似然函数可分解性,并行计算各个局部似然函数.实验结果表明PL-EM为解决大样本条件下贝叶斯网络参数学习提供了一种有效的方法.  相似文献   

5.
贝叶斯网络的学习可以分为结构学习和参数学习。期望最大化(EM)算法通常用于不完整数据的参数学习,但是由于EM算法计算相对复杂,存在收敛速度慢和容易局部最大化等问题,传统的EM算法难于处理大规模数据集。研究了EM算法的主要问题,采用划分数据块的方法将大规模数据集划分为小的样本集来处理,降低了EM算法的计算量,同时也提高了计算精度。实验证明,该改进的EM算法具有较高的性能。  相似文献   

6.
针对朴素贝叶斯分类算法中缺失数据填补问题,提出一种基于改进EM(Expectation Maximization)算法的朴素贝叶斯分类算法。该算法首先根据灰色相关度对缺失数据一个估计,估计值作为执行EM算法的初始值,迭代执行E步M步后完成缺失数据的填补,然后用朴素贝叶斯分类算法对样本进行分类。实验结果表明,改进算法具有较高的分类准确度。并将改进的算法应用于高校教师岗位等级的评定。  相似文献   

7.
当数据存在缺值时,通常应用EM算法学习贝叶斯网络.然而,EM算法以联合似然作为目标函数,与判别预测问题的目标相偏离.与EM算法不同,CEM(Conditional Expectation Maximum)算法直接以条件似然作为目标函数.研究了判别贝叶斯网络学习的CEM算法,提出一种使得CEM算法具有单调性和收敛性的Q函数.为了简化计算,在CEM算法的E步,应用Q函数的一种简化形式;在CEM算法的M步,应用梯度下降法的一次搜索结果作为最优值的近似.最后,在UCI数据集上的实验结果表明了CEM算法在判别贝叶斯网络学习中的有效性.  相似文献   

8.
针对朴素贝叶斯算法存在的三方面约束和限制,提出一种数据缺失条件下的贝叶斯优化算法。该算法计算任两个属性的灰色相关度,根据灰色相关度完成相关属性的联合、冗余属性的删除和属性加权;根据灰色相关度执行改进EM算法完成缺失数据的填补,对经过处理的数据集用朴素贝叶斯算法进行分类。实验结果验证了该优化算法的有效性。  相似文献   

9.
摘 要: 多维分类根据数据实例的特征向量将数据实例在多个维度上进行分类,具有广泛的应用前景。在多维分类算法的模型学习过程中,海量的训练数据使得准确的分类算法需要很长的模型训练时间。为了提高多维分类的执行效率,同时保持高的预测准确性,本文提出了一种基于贝叶斯网络的多维分类学习方法。首先,将多维分类问题描述为条件概率分布问题。其次,根据类别向量之间的依赖关系建立了条件树贝叶斯网络模型。最后,根据训练数据集对条件树贝叶斯网络模型的结构和参数进行学习,并提出了一种多维分类预测算法。大量的真实数据集实验表明,本文提出的方法与当前最好的多维分类算法MMOC相比,在保持高准确性的同时将模型的训练时间降低了两个数量级。因此,本文提出的方法更适用于海量数据的多维分类应用中。  相似文献   

10.
时间复杂性是基于EM框架的贝叶斯网络学习算法应用的一个瓶颈问题,本文首先提出一种并行的参数EM算法来学习具有缺省数据的贝叶斯网络参数,实验表明该算法可有效降低参数学习的时间复杂性,进而将该算法应用到结构EM算法中,提出一种并行的结构EM算法(PL-SEM),PL-SEM算法并行地计算各个样本的期望充分因子和贝叶斯网络的参数,降低结构学习的时间复杂性.  相似文献   

11.
On classification with incomplete data   总被引:4,自引:0,他引:4  
We address the incomplete-data problem in which feature vectors to be classified are missing data (features). A (supervised) logistic regression algorithm for the classification of incomplete data is developed. Single or multiple imputation for the missing data is avoided by performing analytic integration with an estimated conditional density function (conditioned on the observed data). Conditional density functions are estimated using a Gaussian mixture model (GMM), with parameter estimation performed using both expectation-maximization (EM) and variational Bayesian EM (VB-EM). The proposed supervised algorithm is then extended to the semisupervised case by incorporating graph-based regularization. The semisupervised algorithm utilizes all available data-both incomplete and complete, as well as labeled and unlabeled. Experimental results of the proposed classification algorithms are shown  相似文献   

12.
The size of datasets is becoming larger nowadays and missing values in such datasets pose serious threat to data analysts. Although various techniques have been developed by researchers to handle missing values in different kinds of datasets, there is not much effort to deal with the missing values in mixed attributes in large datasets. This paper has proposed novel strategies for dealing with this issue. The significant attributes (covariates) required for imputation are first selected using gain ratio measure to decrease the computational complexity. Since analysis of continuous attributes in imputation process is complex, they are first discretized using a novel methodology called Bayesian classifier-based discretization. Then, missing values in them are imputed using Bayesian max–min ant colony optimization algorithm which hybridizes ACO with Bayesian principles. The local search technique is also introduced in ACO implementation to improve its exploitative capability. The proposed methodology is implemented in real datasets with different missing rates ranging from 5 to 50% and from the experimental results, it is observed that the proposed discretization and imputation algorithms produce better results than the existing methods.  相似文献   

13.
无线传感器网络中的缺失数据对后续的数据分析带来很多不利影响,在数据分析之前,预处理工作必不可少。传感器网络数据在时间和空间方面均存在一定的变化规律,现有的缺失值填补算法往往只从单一角度分析解决问题,为了充分利用时空2个维度的特性,本文提出一种基于时空相关性的缺失值填补方法。该方法运用回归拟合、改进的BP神经网络等方法,对缺失数据进行填补。实验结果表明,该方法可以有效地提升缺失值填补的精度。  相似文献   

14.
研究了改进的基于SVM-EM算法融合的朴素贝叶斯文本分类算法以及在垃圾邮件过滤中的应用。针对朴素贝叶斯算法无法处理基于特征组合产生的变化结果,以及过分依赖于样本空间的分布和内在不稳定性的缺陷,造成了算法时间复杂度的增加。为了解决上述问题,提出了一种改进的基于SVM-EM算法的朴素贝叶斯算法,提出的方法充分结合了朴素贝叶斯算法简单高效、EM算法对缺失属性的填补、支持向量机三种算法的优点,首先利用非线性变换和结构风险最小化原则将流量分类转换为二次寻优问题,然后要求EM算法对朴素贝叶斯算法要求条件独立性假设进行填补,最后利用朴素贝叶斯算法过滤邮件,提高分类准确性和稳定性。仿真实验结果表明,与传统的邮件过滤算法相比,该方法能够快速得到最优分类特征子集,大大提高了垃圾邮件过滤的准确率和稳定性。  相似文献   

15.
We present a maximum margin parameter learning algorithm for Bayesian network classifiers using a conjugate gradient (CG) method for optimization. In contrast to previous approaches, we maintain the normalization constraints on the parameters of the Bayesian network during optimization, i.e., the probabilistic interpretation of the model is not lost. This enables us to handle missing features in discriminatively optimized Bayesian networks. In experiments, we compare the classification performance of maximum margin parameter learning to conditional likelihood and maximum likelihood learning approaches. Discriminative parameter learning significantly outperforms generative maximum likelihood estimation for naive Bayes and tree augmented naive Bayes structures on all considered data sets. Furthermore, maximizing the margin dominates the conditional likelihood approach in terms of classification performance in most cases. We provide results for a recently proposed maximum margin optimization approach based on convex relaxation. While the classification results are highly similar, our CG-based optimization is computationally up to orders of magnitude faster. Margin-optimized Bayesian network classifiers achieve classification performance comparable to support vector machines (SVMs) using fewer parameters. Moreover, we show that unanticipated missing feature values during classification can be easily processed by discriminatively optimized Bayesian network classifiers, a case where discriminative classifiers usually require mechanisms to complete unknown feature values in the data first.  相似文献   

16.
Yeon  Hanbyul  Seo  Seongbum  Son  Hyesook  Jang  Yun 《The Journal of supercomputing》2022,78(2):1759-1782

Bayesian network is derived from conditional probability and is useful in inferring the next state of the currently observed variables. If data are missed or corrupted during data collection or transfer, the characteristics of the original data may be distorted and biased. Therefore, predicted values from the Bayesian network designed with missing data are not reliable. Various techniques have been studied to resolve the imperfection in data using statistical techniques or machine learning, but since the complete data are unknown, there is no optimal way to impute missing values. In this paper, we present a visual analysis system that supports decision-making to impute missing values occurring in panel data. The visual analysis system allows data analysts to explore the cause of missing data in panel datasets. The system also enables us to compare the performance of suitable imputation models with the Bayesian network accuracy and the Kolmogorov–Smirnov test. We evaluate how the visual analysis system supports the decision-making process for the data imputation with datasets in different domains.

  相似文献   

17.
When applying data-mining techniques to real-world data, we often find ourselves facing observations that have no value recorded for some attributes. This can be caused by several phenomena, such as a machine’s incapability to record certain characteristics or a person refusing to answer a question in a poll. Depending on that motivation, values gone missing may follow one kind of pattern or another, or describe no regularity at all. One approach to palliate the effect of missing data on machine learning tasks is to replace the missing observations. Imputation algorithms attempt to calculate a value for a missing gap, using information associated with it, i.e., the attribute and/or other values in the same observation. While several imputation methods have been proposed in the literature, few works have addressed the question of the relationship between the type of missing data, the choice of the imputation method, and the effectiveness of classification algorithms that used the imputed data. In this paper we address the relationship among these three factors. By constructing a benchmark of hundreds of databases containing different types of missing data, and applying several imputation methods and classification algorithms, we empirically show that an interaction between imputation methods and supervised classification can be deduced. Besides, differences in terms of classification performance for the same imputation method in different missing data patterns have been found. This points to the convenience of considering the combined choice of the imputation method and the classifier algorithm according to the missing data type.  相似文献   

18.
The performance of classification algorithms is highly dependent on the quality of training data. Missing attribute values are quite common in many real world applications, thus, in such cases, a complementary method to improve the quality of the data and, consequently, promote enhancements of the classifier performance, is necessary. To deal with this problem, two strategies are commonly employed in practice, 1) multiple imputation, which often maintains the statistical properties of the original data and, usually, has good performance, at the expense of high computational costs; 2) single imputation, which, in general, provides a suitable solution for data sets with a few missing attribute values, but hardly achieve good results when the number of missing values is high. This paper proposes a new single imputation method which uses Attribute-based Decision Graphs (AbDG) to estimate the missing values. AbDGs are a new type of data graphs which embed the information contained in the training set into a graph structure, built over pre-defined intervals of values from different attributes. As a consequence, similar data instances induce similar subgraphs when projected onto the AbDG, resulting in distinct patterns of connections. The main contribution of the paper is the proposal of a well-defined procedure to perform imputation, by partially matching instances with missing values against the AbDG. The proposed imputation method can effectively deal with data sets having high rates of missing attribute values while presenting low computational cost; a significant result towards the development of robust expert and intelligent systems. The obtained results show evidences that the proposed method is sound and promote qualitative imputation for classification purposes.  相似文献   

19.
In this paper, we employ a novel two-stage soft computing approach for data imputation to assess the severity of phishing attacks. The imputation method involves K-means algorithm and multilayer perceptron (MLP) working in tandem. The hybrid is applied to replace the missing values of financial data which is used for predicting the severity of phishing attacks in financial firms. After imputing the missing values, we mine the financial data related to the firms along with the structured form of the textual data using multilayer perceptron (MLP), probabilistic neural network (PNN) and decision trees (DT) separately. Of particular significance is the overall classification accuracy of 81.80%, 82.58%, and 82.19% obtained using MLP, PNN, and DT respectively. It is observed that the present results outperform those of prior research. The overall classification accuracies for the three risk levels of phishing attacks using the classifiers MLP, PNN, and DT are also superior.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号