首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 21 毫秒
1.
海产品安全预警系统缺失数据填补方法   总被引:1,自引:0,他引:1  
针对海产品安全预警系统中数据缺失问题,提出了一种缺失数据填补方法,目前,使用粗糙集填补的方法很多,但很多方法并没有考虑到每个对象缺失属性个数。该方法将存在缺失数据的信息表分为完备和不完备两部分,并分别对其进行处理,对缺失数据填补时综合考虑属性重要性和缺失属性个数;不存在缺失数据的信息表则直接输出;实验结果表明方法能用于海产品安全预警系统中缺失数据填补。  相似文献   

2.
It is an important research issue to deal with mixture models when missing values occur in the data. In this paper, computational strategies using auxiliary indicator matrices are introduced for efficiently handling mixtures of multivariate normal distributions when the data are missing at random and have an arbitrary missing data pattern, meaning that missing data can occur anywhere. We develop a novel EM algorithm that can dramatically save computation time and be exploited in many applications, such as density estimation, supervised clustering and prediction of missing values. In the aspect of multiple imputations for missing data, we also offer a data augmentation scheme using the Gibbs sampler. Our proposed methodologies are illustrated through some real data sets with varying proportions of missing values.  相似文献   

3.
传统方法对缺失数据进行修复,通常存在填补效果较差、所需时间较长和填补准确度较低等问题。提出一种针对时空大数据的缺失数据流关联修复方法。方法首先利用数据流之间的关联规则创建条件函数依赖,然后对数据流之间的关联相似度进行计算,并通过该相似度结果进一步计算缺失数据的加权值,完成对缺失数据和相应的临界点之间融合情况的检测。最后选择最佳置信度方法决定缺失数据的修复顺序,以实现对时空大数据缺失数据流的修复。经过仿真证明,提出的方法对缺失数据流检测准确,且修复效果好,经过修复的数据流与原始时空大数据十分接近。  相似文献   

4.
Missing data is a widespread problem that can affect the ability to use data to construct effective prediction systems. We investigate a common machine learning technique that can tolerate missing values, namely C4.5, to predict cost using six real world software project databases. We analyze the predictive performance after using the k-NN missing data imputation technique to see if it is better to tolerate missing data or to try to impute missing values and then apply the C4.5 algorithm. For the investigation, we simulated three missingness mechanisms, three missing data patterns, and five missing data percentages. We found that the k-NN imputation can improve the prediction accuracy of C4.5. At the same time, both C4.5 and k-NN are little affected by the missingness mechanism, but that the missing data pattern and the missing data percentage have a strong negative impact upon prediction (or imputation) accuracy particularly if the missing data percentage exceeds 40%.  相似文献   

5.

One relevant problem in data quality is missing data. Despite the frequent occurrence and the relevance of the missing data problem, many machine learning algorithms handle missing data in a rather naive way. However, missing data treatment should be carefully treated, otherwise bias might be introduced into the knowledge induced. In this work, we analyze the use of the k-nearest neighbor as an imputation method. Imputation is a term that denotes a procedure that replaces the missing values in a data set with some plausible values. One advantage of this approach is that the missing data treatment is independent of the learning algorithm used. This allows the user to select the most suitable imputation method for each situation. Our analysis indicates that missing data imputation based on the k-nearest neighbor algorithm can outperform the internal methods used by C4.5 and CN2 to treat missing data, and can also outperform the mean or mode imputation method, which is a method broadly used to treat missing values.  相似文献   

6.
数据的完整性是数据可用性的重要维度。由于数据采集等过程中存在的问题,现实中的数据往往存在缺失。现有的聚类算法在面对不完整数据时一般采用忽略缺失或填补缺失的策略,但是当数据缺失属于非随机缺失时,这样的处理策略会导致聚类精度严重下降。当数据缺失属于非随机缺失时,数据缺失模式与缺失属性的取值相关,因此在不完整对象的相似度量中加入缺失模式相似的度量,提出了两种结合缺失模式的 PCM(Possibilistic c-means)模糊聚类算法:最小化缺失模式距离之和的 PatDistPCM 算法和基于缺失模式聚类的 PatCluPCM 算法。在两个公开数据集上的实验证明, 考虑缺失模式的模糊聚类PatDistPCM和PatCluPCM算法,在对存在非随机缺失的数据进行聚类时 ,能有效提高聚类结果的准确性。  相似文献   

7.
This paper examines properties of test statistics for random effects with incomplete panel data. We can divide incomplete panel data into two groups. One group arises from randomly missing or unbalanced data and the other arises from systematically missing data. We focus on the former case. Some statistical properties when there are missing independent variables in regression analysis are well known. A simple approach to treat missing observations is to just discard the missing cases, but such approach may be highly inefficient. In this paper, instead of discarding the missing cases, we consider the missing data to be the outcome of a random variable. The test statistic for random effects with randomly missing panel data is derived. We examine the statistical properties of the derived test statistic and compare it with test statistic derived without randomness. We find that our test statistic is conservative in comparison with the test statistic derived without randomness.  相似文献   

8.
为了提高无线传感器网络(WSN)中缺失数据估计值的精度,提出了一种自决策插值算法。该算法能够根据数据集的空间相关性以及缺失数据的连续性选择不同的缺失数据估计策略,并将自回归滑动平均(ARMA)模型引入到对缺失数据插值的研究中。与传统缺失值估计算法相比,该算法不仅考虑到无线传感器网络的特性,而且考虑到数据集本身的特性。在真实数据集上测试结果表明,该算法提高了对缺失值估计的精度。  相似文献   

9.
数据缺失在电力负荷数据采集过程中经常发生,对提高算法的预测精确度带来了不利影响。现有的缺失数据补全算法只适用于缺失数据量较少的情况,而对于缺失数据较多的情况表现不佳。面对严重数据缺失的挑战,文中提出了一种基于稀疏表示的电力负荷缺失数据补全方法。首先以数据随机缺失为前提,将训练数据中假定缺失后的数据与完整的训练数据上下拼接构成训练矩阵;其次,利用离散余弦变换(Discrete Cosine Transform,DCT)生成一个过完备字典,并根据训练矩阵对其进行学习,旨在通过调优得到一个合适的字典,能对训练矩阵中的样本进行最好的稀疏表示。最后,在测试阶段,先利用学习后字典的上半部分获得测试集缺失数据的稀疏表示,然后利用稀疏表示和学习后字典的下半部分重构出无缺失的完整数据。实验结果表明,使用该方法对电力负荷数据缺失值进行补全,可以获得比传统插值方法、基于相关性的KNN算法、时空压缩感知估计算法以及时序压缩感知预测算法更高的精度。即使数据缺失率高达95%,该方法依然可以有效地补全缺失数据。  相似文献   

10.
缺失数据处理方法研究综述   总被引:1,自引:0,他引:1       下载免费PDF全文
大数据时代,数据爆炸式的增长,数据获取变得更容易的同时数据缺失现象也更加普遍。数据的缺失极大地降低了数据的实用性。数据缺失问题的处理成为大数据处理的热点研究课题。介绍了数据缺失问题的研究意义和国内外研究现状。系统地分析了造成数据缺失的原因,对数据缺失问题进行了分类。对近年来国内外缺失数据处理方法进行了综述,总结了各自优缺点、适用范围、效果评价指标。重点阐述了回归填充、聚类填充等填充方法。对缺失数据处理方法领域进行了总结与展望。  相似文献   

11.
不完整数据的分析与填充一直是大数据处理的热点研究课题,传统的分析方法无法对不完整数据直接聚类,大部分方法先填充缺失值,然后对数据聚类。这些方法一般利用整个数据集对缺失数据进行填充,使得填充值容易受到噪声的干扰,导致填充结果不精确,进而造成聚类精度很低。提出一种不完整数据聚类算法,对不完全信息系统的相似度公式进行重新定义,给出不完整数据对象间的相似度度量方式,进而直接对不完整数据聚类。根据聚类结果将同一类对象划分到相同的簇中,通过同一类对象的属性值对缺失值进行填充,避免噪声对填充值的干扰,提高填充结果的精确性。实验结果表明,提出的方法能够对不完整数据进行聚类,并有效提高缺失数据的填充精度。  相似文献   

12.
Data plays a vital role as a source of information to organizations, especially in times of information and technology. One encounters a not-so-perfect database from which data is missing, and the results obtained from such a database may provide biased or misleading solutions. Therefore, imputing missing data to a database has been regarded as one of the major steps in data mining. The present research used different methods of data mining to construct imputative models in accordance with different types of missing data. When the missing data is continuous, regression models and Neural Networks are used to build imputative models. For the categorical missing data, the logistic regression model, neural network, C5.0 and CART are employed to construct imputative models. The results showed that the regression model was found to provide the best estimate of continuous missing data; but for categorical missing data, the C5.0 model proved the best method.  相似文献   

13.
马茜  谷峪  李芳芳  于戈 《软件学报》2016,27(9):2332-2347
近年来,随着感知网络的广泛应用,感知数据呈爆炸式增长.但是由于受到硬件设备的固有限制、部署环境的随机性以及数据处理过程中的人为失误等多方面因素的影响,感知数据中通常包含大量的缺失值.而大多数现有的上层应用分析工具无法处理包含缺失值的数据集,因此对缺失数据进行填补是不可或缺的.目前也有很多缺失数据填补算法,但在缺失数据较为密集的情况下,已有算法的填补准确性很难保证,同时未考虑填补顺序对填补精度的影响.基于此,提出了一种面向多源感知数据且顺序敏感的缺失值填补框架OMSMVI(order-sensitive missing value imputation framework for multi-source sensory data).该框架充分利用感知数据特有的多维度相关性:时间相关性、空间相关性、属性相关性,对不同数据源间的相似度进行衡量;进而,基于多维度相似性构建以缺失数据源为中心的相似图,并将已填补的缺失值作为观测值用于后续填补过程中.同时考虑缺失数据源的整体分布,提出对缺失值进行顺序敏感的填补,即:首先对缺失值的填补顺序进行决策,再对缺失值进行填补.对缺失值进行顺序填补能够有效缓解在缺失数据较为密集的情况下,由于缺失数据源的完整近邻与其相似度较低引起的填补精度下降问题;最后,对KNN填补算法进行改进,提出一种新的基于近邻节点的缺失值填补算法NI(neighborhood-based imputation),该算法利用感知数据的多维度相似性对缺失数据源的所有近邻节点进行查找,解决了KNN填补算法K值难以确定的问题,也进一步提高了填补准确性.利用两个真实数据集,并与基本填补算法进行对比,验证了算法的准确性及有效性.  相似文献   

14.
Databases for data mining often have missing values. Missing data are often mistreated in data mining and valuable knowledge related to missing data is often overlooked. This study discusses patterns of missing data in survey databases. It proposes a framework of rough set rule induction method that enables the data miner to obtain association rules of patterns of missing data in a survey database. Through an experiment on a real-world data set, we demonstrate the approach to discovering knowledge about missing data.  相似文献   

15.
为了避免智能路灯控制系统受缺失数据影响,研究了智能路灯控制系统缺失数据流关联修复仿真方法。选取嵌套窗口的流数据处理模型检测智能路灯控制系统的缺失数据流,嵌套窗口的流数据处理模型将滑动窗口分割为多个嵌套滑动窗口,滑动窗口过程中,利用Pearson相关系数确定智能路灯控制系统中相邻数据的相关性,依据获取的数据相关性重构相关图,利用相关图检测智能路灯控制系统缺失数据流。选取GMM算法划分存在缺失数据的智能路灯控制系统的数据集,选取EM插补算法关联修复完成初始数据集划分后的数据流。仿真测试结果表明,该方法可以有效修复智能路灯控制系统缺失数据流,不同数据缺失率情况下,均具有较高的缺失数据流关联修复精度。  相似文献   

16.
工业过程数据中缺失值处理方法的研究   总被引:1,自引:0,他引:1  
针对工业生产中过程数据的缺失问题,首次提出了运用多重填补方法处理工业过程的缺失数据.阐述了常用的缺失数据处理方法,指出各方法的优缺点.在此基础上,通过建立回归模型,针对多变量工业数据中缺失值较少和较多时的两种情况,分别用删除含缺失值的个案,简单填补和多重填补(MI)3种方法对数据进行处理,利用处理后的新数据集进行数据挖掘,预测目标变量的值,并对预测结果进行分析比较.实验结果表明,多重填补方法的处理效果最好,为工业数据的缺失值处理提供了有用的策略.  相似文献   

17.
Data mining with incomplete survey data is an immature subject area. Mining a database with incomplete data, the patterns of missing data as well as the potential implication of these missing data constitute valuable knowledge. This paper presents the conceptual foundations of data mining with incomplete data through classification which is relevant to a specific decision making problem. The proposed technique generally supposes that incomplete data and complete data may come from different sub-populations. The major objective of the proposed technique is to detect the interesting patterns of data missing behavior that are relevant to a specific decision making, instead of estimation of individual missing value. Using this technique, a set of complete data is used to acquire a near-optimal classifier. This classifier provides the prediction reference information for analyzing the incomplete data. The data missing behavior concealed in the missing data is then revealed. Using a real-world survey data set, the paper demonstrates the usefulness of this technique.  相似文献   

18.
In this paper we investigate applying SOM (Self-Organizing Maps) for classification and rule extraction in data sets with missing values, in particular from real clinical data of bladder cancer patients. For this experiment, we used real data of bladder cancer patients provided by Kitasato University Hospital. When using input data with missing values for SOM, the missing value is either interpolated in the preprocessing stage, or the missing value is replaced with a specific value or property that marks it as a missing value. In either case, there is a possibility some rules can be extracted from data with missing values. On the other hand, these data can have a negative influence for the classification for data sets for which missing values should be neglected. In this research we propose a method where SOM is trained using an input vector in which the properties for the missing values are excluded. The influence of information on the missing values can be reduced by using the proposed method. Through computer simulation, we showed that the proposed method gave good results in classification and rule extraction from clinical data of bladder cancer patients. This work was presented in part at the 13th International Symposium on Artificial Life and Robotics, Oita, Japan, January 31–February 2, 2008  相似文献   

19.
数据缺失会影响数据的质量,可能导致分析结果的不准确和降低模型的可靠性,缺失值填补能减低偏差方便后续分析.大多数的缺失值填补算法,都是假设多项缺失值之间是弱相关甚至无相关,很少考虑缺失值之间的相关性以及填补顺序.在销售领域中对缺失值进行独立填补,会减少缺失值信息的利用,从而对缺失值填补的准确度造成较大的影响.针对以上问题,本文以销售领域为研究目标,根据销售行为的多维度特征,利用不同模型输出值的空间分布特征特性,探索多项缺失值的填补更新机制,研究面向销售数据多项缺失值增量填补方法,根据特征相关性,对缺失特征排序并用已填补的数据作为信息要素融合对后面的缺失值进行增量填补.该算法同时考虑了模型的泛化性和缺失数据之间的信息相关问题,并结合多模型融合,对多项缺失值进行有效填补.最后基于真实连锁药店销售数据集通过大量实验对比验证了所提算法的有效性.  相似文献   

20.
In this paper, we propose new missing imputation methods for the missing genotype data of single nucleotide polymorphism (SNP). The common objective of imputation methods is to minimize the loss of information caused by experimental missing elements. In general, imputation of missing genotype data has used a major allele method, but this approach is not far from the objective of the imputation - minimizing the loss of information. This method generally produces high error rates of missing value estimation, since the characteristics of the genotype data are not considered over the structure of given genotype data. In our methods, we use the linkage disequilibrium and haplotype information for the missing SNP genotype. As a result, we provide the results of the comparative evaluation of our methods and major allele imputation method according to the various randomized missing rates.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号