首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
针对k最近邻填充算法(kNNI)在缺失数据的k个最近邻可能存在噪声,提出一种新的缺失值填充算法——相互k最近邻填充算法MkNNI(Mutualk-NearestNeighborImputa—tion)。用于填充缺失值的数据,不仅是缺失数据的k最近邻,而且它的k最近邻也包含该缺失数据.从而有效地防止kNNI算法选取的k个最近邻点可能存在噪声这一情况。实验结果表明.MkNNI算法的填充准确性总体上要优于kNNI算法。  相似文献   

2.
Microarray data are used in many biomedical experiments. They often contain missing values which significantly affect statistical algorithms. Although a number of imputation algorithms have been proposed, they have various limitations to exploit local and global information effectively for estimation. It is necessary to develop more effective techniques to solve the data imputation problem. In this paper, we propose a theoretic framework of local weighted approximation for missing value estimation, based on the Taylor series approximation. Besides revealing that k-nearest neighbor imputation (KNNimpute) is a special case of the framework, we focus on the study of its linear case—local weighted linear approximation imputation (LWLAimpute) from theory to experiment. Experimental results show that LWLAimpute and its iterative version can achieve better performance than some existing imputation methods, the superiority becomes more significant with increasing level of missing values.  相似文献   

3.
New imputation algorithms for estimating missing values in compositional data are introduced. A first proposal uses the k-nearest neighbor procedure based on the Aitchison distance, a distance measure especially designed for compositional data. It is important to adjust the estimated missing values to the overall size of the compositional parts of the neighbors. As a second proposal an iterative model-based imputation technique is introduced which initially starts from the result of the proposed k-nearest neighbor procedure. The method is based on iterative regressions, thereby accounting for the whole multivariate data information. The regressions have to be performed in a transformed space, and depending on the data quality classical or robust regression techniques can be employed. The proposed methods are tested on a real and on simulated data sets. The results show that the proposed methods outperform standard imputation methods. In the presence of outliers, the model-based method with robust regressions is preferable.  相似文献   

4.
针对欧式距离填充算法不足和微阵列数据集中缺失数据比率过大问题,提出了使用马氏距离有序填充微阵列的最近邻算法,能充分使用数据集中所有有效信息填充缺失数据,真实基因数据集的实验结果显示改进后的最近邻算法明显优于存在算法。  相似文献   

5.
基于马氏距离的缺失值填充算法   总被引:1,自引:0,他引:1  
杨涛  骆嘉伟  王艳  吴君浩 《计算机应用》2005,25(12):2868-2871
提出了一种基于马氏距离的填充算法来估计基因表达数据集中的缺失数据。该算法通过基因之间的马氏距离来选择最近邻居基因,并将已得到的估计值应用到后续的估计过程中,然后采用信息论中熵值的概念计算最近邻居的加权系数,得到缺失数据的填充值。实验结果证明了该算法具有有效性,其性能优于其他基于最近邻居法的缺失值处理算法。  相似文献   

6.
Missing data is a widespread problem that can affect the ability to use data to construct effective prediction systems. We investigate a common machine learning technique that can tolerate missing values, namely C4.5, to predict cost using six real world software project databases. We analyze the predictive performance after using the k-NN missing data imputation technique to see if it is better to tolerate missing data or to try to impute missing values and then apply the C4.5 algorithm. For the investigation, we simulated three missingness mechanisms, three missing data patterns, and five missing data percentages. We found that the k-NN imputation can improve the prediction accuracy of C4.5. At the same time, both C4.5 and k-NN are little affected by the missingness mechanism, but that the missing data pattern and the missing data percentage have a strong negative impact upon prediction (or imputation) accuracy particularly if the missing data percentage exceeds 40%.  相似文献   

7.
When applying data-mining techniques to real-world data, we often find ourselves facing observations that have no value recorded for some attributes. This can be caused by several phenomena, such as a machine’s incapability to record certain characteristics or a person refusing to answer a question in a poll. Depending on that motivation, values gone missing may follow one kind of pattern or another, or describe no regularity at all. One approach to palliate the effect of missing data on machine learning tasks is to replace the missing observations. Imputation algorithms attempt to calculate a value for a missing gap, using information associated with it, i.e., the attribute and/or other values in the same observation. While several imputation methods have been proposed in the literature, few works have addressed the question of the relationship between the type of missing data, the choice of the imputation method, and the effectiveness of classification algorithms that used the imputed data. In this paper we address the relationship among these three factors. By constructing a benchmark of hundreds of databases containing different types of missing data, and applying several imputation methods and classification algorithms, we empirically show that an interaction between imputation methods and supervised classification can be deduced. Besides, differences in terms of classification performance for the same imputation method in different missing data patterns have been found. This points to the convenience of considering the combined choice of the imputation method and the classifier algorithm according to the missing data type.  相似文献   

8.
缺失填补是机器学习与数据挖掘领域中极富有挑战性的工作。数据源中的缺失值会对学习算法的性能与学习的质量产生较大的负面影响。目前存在的缺失值填补方法还不能满足用户的需要。提出了一种基于灰色系统理论的缺失值填补方法,该方法采用了基于实例学习的非参拟合和灰色理论技术,对缺失数据进行重复填补,直至填补结果收敛或者满足用户的需要。实验结果表明,该方法在填补效果与效率方面都比现有的KNN填补法和普通的均值替代法要好。  相似文献   

9.
While there is an ample amount of medical information available for data mining, many of the datasets are unfortunately incomplete – missing relevant values needed by many machine learning algorithms. Several approaches have been proposed for the imputation of missing values, using various reasoning steps to provide estimations from the observed data. One of the important steps in data mining is data preprocessing, where unrepresentative data is filtered out of the data to be mined. However, none of the related studies about missing value imputation consider performing a data preprocessing step before imputation. Therefore, the aim of this study is to examine the effect of two preprocessing steps, feature and instance selection, on missing value imputation. Specifically, eight different medical‐related datasets are used, containing categorical, numerical and mixed types of data. Our experimental results show that imputation after instance selection can produce better classification performance than imputation alone. In addition, we will demonstrate that imputation after feature selection does not have a positive impact on the imputation result.  相似文献   

10.
To complete missing values a solution is to use correlations between the attributes of the data. The problem is that it is difficult to identify relations within data containing missing values. Accordingly, we develop a kernel-based missing data imputation in this paper. This approach aims at making an optimal inference on statistical parameters: mean, distribution function and quantile after missing data are imputed. And we refer this approach to parameter optimization method (POP algorithm). We experimentally evaluate our approach, and demonstrate that our POP algorithm (random regression imputation) is much better than deterministic regression imputation in efficiency and generating an inference on the above parameters.  相似文献   

11.
ContextAlthough independent imputation techniques are comprehensively studied in software effort prediction, there are few studies on embedded methods in dealing with missing data in software effort prediction.ObjectiveWe propose BREM (Bayesian Regression and Expectation Maximization) algorithm for software effort prediction and two embedded strategies to handle missing data.MethodThe MDT (Missing Data Toleration) strategy ignores the missing data when using BREM for software effort prediction and the MDI (Missing Data Imputation) strategy uses observed data to impute missing data in an iterative manner while elaborating the predictive model.ResultsExperiments on the ISBSG and CSBSG datasets demonstrate that when there are no missing values in historical dataset, BREM outperforms LR (Linear Regression), BR (Bayesian Regression), SVR (Support Vector Regression) and M5′ regression tree in software effort prediction on the condition that the test set is not greater than 30% of the whole historical dataset for ISBSG dataset and 25% of the whole historical dataset for CSBSG dataset. When there are missing values in historical datasets, BREM with the MDT and MDI strategies significantly outperforms those independent imputation techniques, including MI, BMI, CMI, MINI and M5′. Moreover, the MDI strategy provides BREM with more accurate imputation for the missing values than those given by the independent missing imputation techniques on the condition that the level of missing data in training set is not larger than 10% for both ISBSG and CSBSG datasets.ConclusionThe experimental results suggest that BREM is promising in software effort prediction. When there are missing values, the MDI strategy is preferred to be embedded with BREM.  相似文献   

12.
微阵列数据中的缺失值会对随后的数据分析造成影响。因此,正确地估计这些缺失值是很必要的。将一个k值选取算法结合到有序的局部最小二乘填补算法中,提出了一种无参数的缺失值填补方法(SLLSkimpute)。该方法的三个特点是:第一,无需事先确定参数;第二,针对不同的目标基因使用不同数目的邻居基因;第三,有序地估计缺失值,并有选择地将已得到的估计值应用到后续的估计过程中。实验结果证实了该算法的有效性,其估计性能优于其它一些常用的填补方法。  相似文献   

13.
Missing data in large insurance datasets affects the learning and classification accuracies in predictive modelling. Insurance datasets will continue to increase in size as more variables are added to aid in managing client risk and will therefore be even more vulnerable to missing data. This paper proposes a hybrid multi-layered artificial immune system and genetic algorithm for partial imputation of missing data in datasets with numerous variables. The multi-layered artificial immune system creates and stores antibodies that bind to and annihilate an antigen. The genetic algorithm optimises the learning process of a stimulated antibody. The evaluation of the imputation is performed using the RIPPER, k-nearest neighbour, naïve Bayes and logistic discriminant classifiers. The effect of the imputation on the classifiers is compared with that of the mean/mode and hot deck imputation methods. The results demonstrate that when missing data imputation is performed using the proposed hybrid method, the classification improves and the robustness to the amount of missing data is increased relative to the mean/mode method for data missing completely at random (MCAR) missing at random (MAR), and not missing at random (NMAR).The imputation performance is similar to or marginally better than that of the hot deck imputation.  相似文献   

14.
传感器网络中一种基于多元回归模型的缺失值估计算法   总被引:1,自引:0,他引:1  
在无线传感器网络中,感知数据的缺失问题不可避免,并且给无线传感器网络的各种应用带来了巨大困难.解决该问题的最好办法是对缺失数据进行准确估计.提出了一种基于多元回归模型的缺失值估计算法.该算法首先依感知数据的时间相关性和空间相关性分别采用多元线性回归模型对缺失数据进行估计,然后根据回归模型的拟合优度对基于时间维和空间维求出的两个估计值分别赋予相应的权值系数,并将其加权平均值作为缺失数据的最后估计值.由于该算法在对缺失值进行估计时,同时考察多个邻居节点并联合地用其感知数据来共同估计缺失值,因此该算法具有可靠、稳定的估计性能.在两个真实的数据集合上对该算法进行了测试,实验结果表明提出的缺失值估计算法能够有效估计无线传感器网络中的缺失数据.  相似文献   

15.
两实例的距离或相似性度量在数据挖掘和机器学习中扮演着重要的角色。常用的距离度量方法主要适用于数值数据,针对分类数据,本文提出一种数据驱动的相似性度量方法。该方法利用属性值与类标签的信息,将属性值的类条件概率结合信息论来度量分类数据的相似性。为了与已提出的相似性度量方法作比较,把各度量方法与k最近邻算法结合,对多个分类数据集进行分类,通过十折交叉验证比较结果的错误率。实验表明该度量结合k最近邻方法使分类具有较低的错误率。  相似文献   

16.
马茜  谷峪  李芳芳  于戈 《软件学报》2016,27(9):2332-2347
近年来,随着感知网络的广泛应用,感知数据呈爆炸式增长.但是由于受到硬件设备的固有限制、部署环境的随机性以及数据处理过程中的人为失误等多方面因素的影响,感知数据中通常包含大量的缺失值.而大多数现有的上层应用分析工具无法处理包含缺失值的数据集,因此对缺失数据进行填补是不可或缺的.目前也有很多缺失数据填补算法,但在缺失数据较为密集的情况下,已有算法的填补准确性很难保证,同时未考虑填补顺序对填补精度的影响.基于此,提出了一种面向多源感知数据且顺序敏感的缺失值填补框架OMSMVI(order-sensitive missing value imputation framework for multi-source sensory data).该框架充分利用感知数据特有的多维度相关性:时间相关性、空间相关性、属性相关性,对不同数据源间的相似度进行衡量;进而,基于多维度相似性构建以缺失数据源为中心的相似图,并将已填补的缺失值作为观测值用于后续填补过程中.同时考虑缺失数据源的整体分布,提出对缺失值进行顺序敏感的填补,即:首先对缺失值的填补顺序进行决策,再对缺失值进行填补.对缺失值进行顺序填补能够有效缓解在缺失数据较为密集的情况下,由于缺失数据源的完整近邻与其相似度较低引起的填补精度下降问题;最后,对KNN填补算法进行改进,提出一种新的基于近邻节点的缺失值填补算法NI(neighborhood-based imputation),该算法利用感知数据的多维度相似性对缺失数据源的所有近邻节点进行查找,解决了KNN填补算法K值难以确定的问题,也进一步提高了填补准确性.利用两个真实数据集,并与基本填补算法进行对比,验证了算法的准确性及有效性.  相似文献   

17.
In this paper, we employ a novel two-stage soft computing approach for data imputation to assess the severity of phishing attacks. The imputation method involves K-means algorithm and multilayer perceptron (MLP) working in tandem. The hybrid is applied to replace the missing values of financial data which is used for predicting the severity of phishing attacks in financial firms. After imputing the missing values, we mine the financial data related to the firms along with the structured form of the textual data using multilayer perceptron (MLP), probabilistic neural network (PNN) and decision trees (DT) separately. Of particular significance is the overall classification accuracy of 81.80%, 82.58%, and 82.19% obtained using MLP, PNN, and DT respectively. It is observed that the present results outperform those of prior research. The overall classification accuracies for the three risk levels of phishing attacks using the classifiers MLP, PNN, and DT are also superior.  相似文献   

18.
当前的不完整数据处理算法填充缺失值时,精度低下。针对这个问题,提出一种基于CFS聚类和改进的自动编码模型的不完整数据填充算法。利用CFS聚类算法对不完整数据集进行聚类,对降噪自动编码模型进行改进,根据聚类结果,利用改进的自动编码模型对缺失数据进行填充。为了使得CFS聚类算法能够对不完整数据集进行聚类,提出一种部分距离策略,用于度量不完整数据对象之间的距离。实验结果表明提出的算法能够有效填充缺失数据。  相似文献   

19.
基于EM和贝叶斯网络的丢失数据填充算法   总被引:2,自引:0,他引:2  
实际应用中存在大量的丢失数据的数据集,对丢失数据的处理已成为目前分类领域的研究热点。分析和比较了几种通用的丢失数据填充算法,并提出一种新的基于EM和贝叶斯网络的丢失数据填充算法。算法利用朴素贝叶斯估计出EM算法初值,然后将EM和贝叶斯网络结合进行迭代确定最终更新器,同时得到填充后的完整数据集。实验结果表明,与经典填充算法相比,新算法具有更高的分类准确率,且节省了大量开销。  相似文献   

20.
针对高通量测序技术因各种原因导致的DNA甲基化测序数据中包含部分缺失值的问题。提出一种基于变分自编码器的DNA甲基化缺失数据填补模型VAE-MethImp。VAE-MethImp是一种深度隐含空间生成模型,由编码层、隐含层和解码层组成,拥有强大的重构输入数据能力。编码层进行均值和方差的推断;隐含层是通过编码层输出的均值和方差计算出的输入数据的专属正态分布;解码层对隐含层包含的特征进行解码生成重构后的数据。通过在肺癌和乳腺癌上的填补实验证明,VAE-MethImp提取的特征更具信息性。在填补精度上,VAE-MethImp比对照方法(均值(Mean)、最近邻(KNN)、主成分分析(PCA)和奇异值分解(SVD))中最优的SVD提升了4.8%。生存分析实验结果显示VAE-MethImp填补的数据具有更好的预测性,同时也证明DNA甲基化与癌症的生存存在直接关联。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号