首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 502 毫秒
1.
王俊陆  王玲  王妍  宋宝燕 《计算机科学》2017,44(2):98-102, 106
随着互联网及信息技术的发展,数据缺失、损坏等问题越来越普遍,尤其随着数据收集工作从人工转向机器,存储介质的不稳定性及网络传输出现遗漏等原因都导致数据缺失更加严重。数据库中大量的缺失值不仅严重影响了用户查询质量,还对数据挖掘与数据分析结果的正确性造成了影响,进而误导决策。目前,对缺失数据的填补还没有一种比较通用的方法,大部分策略都是针对某一类型的缺失值问题进行处理。因此,针对不同缺失类型同时出现在不完备数据中的复杂情况,提出了一种基于元组相似度的不完备数据填补方法(IATS)。采用数据挖掘的方法提取出不完备数据集中的加权关联规则,并根据此规则进行常规缺失数据的填补,而对于数据集的异常缺失问题,又引入数据推荐算法,采用推荐筛选策略进行元组相似度的计算并实现相应填补,在很大程度上提高了数据的有效利用率和用户查询结果的质量。实验表明,IATS策略在保证填补率的前提下具有更好的准确率。  相似文献   

2.
传统方法对缺失数据进行修复,通常存在填补效果较差、所需时间较长和填补准确度较低等问题。提出一种针对时空大数据的缺失数据流关联修复方法。方法首先利用数据流之间的关联规则创建条件函数依赖,然后对数据流之间的关联相似度进行计算,并通过该相似度结果进一步计算缺失数据的加权值,完成对缺失数据和相应的临界点之间融合情况的检测。最后选择最佳置信度方法决定缺失数据的修复顺序,以实现对时空大数据缺失数据流的修复。经过仿真证明,提出的方法对缺失数据流检测准确,且修复效果好,经过修复的数据流与原始时空大数据十分接近。  相似文献   

3.
提出一种基于支持向量机的缺失值填补方法。该方法将缺失值填补分为连续属性缺失值填补和类别属性缺失值填补两种情况。对于连续属性的情况,采用支持向量机回归进行缺失值的预测;对于类别属性的情况,采用支持向量机分类进行缺失值的预测。在几个UCI数据集和MINIT手写阿拉伯数字数据集上的对比实验说明,该算法较传统的均值填补方法和基于决策树回归的缺失值填补方法具有更高的恢复率。  相似文献   

4.
数据缺失会影响数据的质量,可能导致分析结果的不准确和降低模型的可靠性,缺失值填补能减低偏差方便后续分析.大多数的缺失值填补算法,都是假设多项缺失值之间是弱相关甚至无相关,很少考虑缺失值之间的相关性以及填补顺序.在销售领域中对缺失值进行独立填补,会减少缺失值信息的利用,从而对缺失值填补的准确度造成较大的影响.针对以上问题,本文以销售领域为研究目标,根据销售行为的多维度特征,利用不同模型输出值的空间分布特征特性,探索多项缺失值的填补更新机制,研究面向销售数据多项缺失值增量填补方法,根据特征相关性,对缺失特征排序并用已填补的数据作为信息要素融合对后面的缺失值进行增量填补.该算法同时考虑了模型的泛化性和缺失数据之间的信息相关问题,并结合多模型融合,对多项缺失值进行有效填补.最后基于真实连锁药店销售数据集通过大量实验对比验证了所提算法的有效性.  相似文献   

5.
马茜  谷峪  李芳芳  于戈 《软件学报》2016,27(9):2332-2347
近年来,随着感知网络的广泛应用,感知数据呈爆炸式增长.但是由于受到硬件设备的固有限制、部署环境的随机性以及数据处理过程中的人为失误等多方面因素的影响,感知数据中通常包含大量的缺失值.而大多数现有的上层应用分析工具无法处理包含缺失值的数据集,因此对缺失数据进行填补是不可或缺的.目前也有很多缺失数据填补算法,但在缺失数据较为密集的情况下,已有算法的填补准确性很难保证,同时未考虑填补顺序对填补精度的影响.基于此,提出了一种面向多源感知数据且顺序敏感的缺失值填补框架OMSMVI(order-sensitive missing value imputation framework for multi-source sensory data).该框架充分利用感知数据特有的多维度相关性:时间相关性、空间相关性、属性相关性,对不同数据源间的相似度进行衡量;进而,基于多维度相似性构建以缺失数据源为中心的相似图,并将已填补的缺失值作为观测值用于后续填补过程中.同时考虑缺失数据源的整体分布,提出对缺失值进行顺序敏感的填补,即:首先对缺失值的填补顺序进行决策,再对缺失值进行填补.对缺失值进行顺序填补能够有效缓解在缺失数据较为密集的情况下,由于缺失数据源的完整近邻与其相似度较低引起的填补精度下降问题;最后,对KNN填补算法进行改进,提出一种新的基于近邻节点的缺失值填补算法NI(neighborhood-based imputation),该算法利用感知数据的多维度相似性对缺失数据源的所有近邻节点进行查找,解决了KNN填补算法K值难以确定的问题,也进一步提高了填补准确性.利用两个真实数据集,并与基本填补算法进行对比,验证了算法的准确性及有效性.  相似文献   

6.
不完备信息系统中决策规则的提取是数据挖掘领域的重要研究问题。对不完备信息系统中决策规则的主要获取方法进行分析,以决策属性具有缺失值的不完备决策表为研究对象,提出一种基于数据优先填补的决策树规则提取算法。针对ROUSTIDA算法在数据填补时运算量较大且容易导致决策规则冲突这一问题,算法采用决策属性优先填补的思想,引入对象完备度概念对其进行改进,使用改进的ROUSTIDA算法对不完备决策表进行一次性数据填补预处理,并在限制容差关系下采用属性重要性为启发函数构建决策树,从而获得决策规则。实例表明该方法是有效的,生成的决策规则简单,且具有较高的精确度。  相似文献   

7.
针对不完全信息多属性决策问题中属性值缺失的情况,为使缺失值的填补更加客观,填补后数据集整体尽量保持填补前的分布,且不丢失已有信息,提出了一种基于机器学习的属性缺失值模糊填补方法。该方法通过寻找不需填补的属性相似的记录,在这些记录中发现需填补属性的可能取值及其概率,按照各取值的概率为缺失值分配相应的取值。该方法的基本思想对于离散型和连续型的数据集均适用。  相似文献   

8.
针对目前大多数分类器简单抛弃缺失数据的问题,基于朴素信念分类提出了一种有缺失值实例的加权保守推理规则的分类算法.以数据集特征属性与决策属性之间的相关系数作为权值,根据有缺失值实例加权保守推理规则对有非随机缺失属性的待分类实例所有可能的类别进行选择.实验结果表明,提出的基于有缺失值实例的加权保守推理规则分类算法有效地提高了分类性能,是一种有效的缺失数据集分类算法.  相似文献   

9.
为了提高客服终端数据可利用性,降低冗余数据干扰程度,挖掘潜在客户,制定销售策略,研究一种基于决策树算法的客服终端冗余数据迭代消除方法。采用数据仓库法抽取并集成客服终端数据,对字符类数据进行去停用词和中文分词预处理,对数值类数据进行缺失值填补和离散值删除预处理。构建ID3决策树,分类客服终端数据,计算同一类数据的类间相似度,构建冗余数据判断规则,检测客服终端冗余数据,联合消除器消除冗余数据。实验结果表明:所研究方法应用后,可以消除客服终端冗余数据,空间缩减比更接近冗余率。  相似文献   

10.
为了较好地处理遥感图像的不确定性或模糊性,提高分类精度,提出了一种基于模糊子集的土地利用遥感图像模糊规则分类方法。将模糊隶属度函数值对应到特定的模糊子集建立模糊规则条件,由样本建立分类规则库,通过计算分类数据规则条件部分与分类规则库中规则条件部分的模糊贴进度进行土地利用分类。结果表明:与传统的最大似然法分类方法相比,基于模糊规则的分类方法在高模糊性数据分类中显著提高了分类精度,在低模糊性数据分类中也能取得与最大似然法近似的结果。  相似文献   

11.
The knowledge discovery process is supported by data files information gathered from collected data sets, which often contain errors in the form of missing values. Data imputation is the activity aimed at estimating values for missing data items. This study focuses on the development of automated data imputation models, based on artificial neural networks for monotone patterns of missing values. The present work proposes a single imputation approach relying on a multilayer perceptron whose training is conducted with different learning rules, and a multiple imputation approach based on the combination of multilayer perceptron and k-nearest neighbours. Eighteen real and simulated databases were exposed to a perturbation experiment with random generation of monotone missing data pattern. An empirical test was accomplished on these data sets, including both approaches (single and multiple imputations), and three classical single imputation procedures – mean/mode imputation, regression and hot-deck – were also considered. Therefore, the experiments involved five imputation methods. The results, considering different performance measures, demonstrated that, in comparison with traditional tools, both proposals improve the automation level and data quality offering a satisfactory performance.  相似文献   

12.
无线传感器网络中的缺失数据对后续的数据分析带来很多不利影响,在数据分析之前,预处理工作必不可少。传感器网络数据在时间和空间方面均存在一定的变化规律,现有的缺失值填补算法往往只从单一角度分析解决问题,为了充分利用时空2个维度的特性,本文提出一种基于时空相关性的缺失值填补方法。该方法运用回归拟合、改进的BP神经网络等方法,对缺失数据进行填补。实验结果表明,该方法可以有效地提升缺失值填补的精度。  相似文献   

13.
Numerous industrial and research databases include missing values. It is not uncommon to encounter databases that have up to a half of the entries missing, making it very difficult to mine them using data analysis methods that can work only with complete data. A common way of dealing with this problem is to impute (fill-in) the missing values. This paper evaluates how the choice of different imputation methods affects the performance of classifiers that are subsequently used with the imputed data. The experiments here focus on discrete data. This paper studies the effect of missing data imputation using five single imputation methods (a mean method, a Hot deck method, a Na?¨ve-Bayes method, and the latter two methods with a recently proposed imputation framework) and one multiple imputation method (a polytomous regression based method) on classification accuracy for six popular classifiers (RIPPER, C4.5, K-nearest-neighbor, support vector machine with polynomial and RBF kernels, and Na?¨ve-Bayes) on 15 datasets. This experimental study shows that imputation with the tested methods on average improves classification accuracy when compared to classification without imputation. Although the results show that there is no universally best imputation method, Na?¨ve-Bayes imputation is shown to give the best results for the RIPPER classifier for datasets with high amount (i.e., 40% and 50%) of missing data, polytomous regression imputation is shown to be the best for support vector machine classifier with polynomial kernel, and the application of the imputation framework is shown to be superior for the support vector machine with RBF kernel and K-nearest-neighbor. The analysis of the quality of the imputation with respect to varying amounts of missing data (i.e., between 5% and 50%) shows that all imputation methods, except for the mean imputation, improve classification error for data with more than 10% of missing data. Finally, some classifiers such as C4.5 and Na?¨ve-Bayes were found to be missing data resistant, i.e., they can produce accurate classification in the presence of missing data, while other classifiers such as K-nearest-neighbor, SVMs and RIPPER benefit from the imputation.  相似文献   

14.
Multibiometric systems, which consolidate or fuse multiple sources of biometric information, typically provide better recognition performance than unimodal systems. While fusion can be accomplished at various levels in a multibiometric system, score-level fusion is commonly used as it offers a good trade-off between data availability and ease of fusion. Most score-level fusion rules assume that the scores pertaining to all the matchers are available prior to fusion. Thus, they are not well equipped to deal with the problem of missing match scores. While there are several techniques for handling missing data in general, the imputation scheme, which replaces missing values with predicted values, is preferred since this scheme can be followed by a standard fusion scheme designed for complete data. In this work, the performance of the following imputation methods are compared in the context of multibiometric fusion: K-nearest neighbor (KNN) schemes, likelihood-based schemes, Bayesian-based schemes and multiple imputation (MI) schemes. Experiments on the MSU database assess the robustness of the schemes in handling missing scores at different missing rates. It is observed that the Gaussian mixture model (GMM)-based KNN imputation scheme results in the best recognition accuracy.  相似文献   

15.
Business intelligence and bioinformatics applications increasingly require the mining of datasets consisting of millions of data points, or crafting real-time enterprise-level decision support systems for large corporations and drug companies. In all cases, there needs to be an underlying data mining system, and this mining system must be highly scalable. To this end, we describe a new rule learner called DataSqueezer. The learner belongs to the family of inductive supervised rule extraction algorithms. DataSqueezer is a simple, greedy, rule builder that generates a set of production rules from labeled input data. In spite of its relative simplicity, DataSqueezer is a very effective learner. The rules generated by the algorithm are compact, comprehensible, and have accuracy comparable to rules generated by other state-of-the-art rule extraction algorithms. The main advantages of DataSqueezer are very high efficiency, and missing data resistance. DataSqueezer exhibits log-linear asymptotic complexity with the number of training examples, and it is faster than other state-of-the-art rule learners. The learner is also robust to large quantities of missing data, as verified by extensive experimental comparison with the other learners. DataSqueezer is thus well suited to modern data mining and business intelligence tasks, which commonly involve huge datasets with a large fraction of missing data.  相似文献   

16.
To complete missing values a solution is to use correlations between the attributes of the data. The problem is that it is difficult to identify relations within data containing missing values. Accordingly, we develop a kernel-based missing data imputation in this paper. This approach aims at making an optimal inference on statistical parameters: mean, distribution function and quantile after missing data are imputed. And we refer this approach to parameter optimization method (POP algorithm). We experimentally evaluate our approach, and demonstrate that our POP algorithm (random regression imputation) is much better than deterministic regression imputation in efficiency and generating an inference on the above parameters.  相似文献   

17.
针对不完备信息系统的数据缺失填补精度不够高问题,以水产养殖预警信息系统为背景,提出一种基于属性相关度的缺失数据填补算法。在有效保证预警信息系统确定性的前提下,通过研究限制容差关系知识和决策规则,根据新定义的限制相容关系求出缺失对象的限制相容类,同时将条件属性之间的相关度概念引入,构造出一种新的扩展矩阵进行数据填补,实现了系统的完备性。以鲈鱼养殖缺失数据填补为实例,以数据集进行填补验证,结果表明与其他方法相比该算法在填补准确度和时间性能上有明显提高。  相似文献   

18.
In this paper, we propose new missing imputation methods for the missing genotype data of single nucleotide polymorphism (SNP). The common objective of imputation methods is to minimize the loss of information caused by experimental missing elements. In general, imputation of missing genotype data has used a major allele method, but this approach is not far from the objective of the imputation - minimizing the loss of information. This method generally produces high error rates of missing value estimation, since the characteristics of the genotype data are not considered over the structure of given genotype data. In our methods, we use the linkage disequilibrium and haplotype information for the missing SNP genotype. As a result, we provide the results of the comparative evaluation of our methods and major allele imputation method according to the various randomized missing rates.  相似文献   

19.

One relevant problem in data quality is missing data. Despite the frequent occurrence and the relevance of the missing data problem, many machine learning algorithms handle missing data in a rather naive way. However, missing data treatment should be carefully treated, otherwise bias might be introduced into the knowledge induced. In this work, we analyze the use of the k-nearest neighbor as an imputation method. Imputation is a term that denotes a procedure that replaces the missing values in a data set with some plausible values. One advantage of this approach is that the missing data treatment is independent of the learning algorithm used. This allows the user to select the most suitable imputation method for each situation. Our analysis indicates that missing data imputation based on the k-nearest neighbor algorithm can outperform the internal methods used by C4.5 and CN2 to treat missing data, and can also outperform the mean or mode imputation method, which is a method broadly used to treat missing values.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号