首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
高菲  宋韶旭  王建民 《软件学报》2021,32(3):689-711
为进一步优化推广大数据及人工智能技术,作为数据管理与分析的基础,数据质量问题日益成为相关领域的研究热点.通常情况下,数据采集及记录仪的物理故障或技术缺陷等会导致收集到的数据存在一定的错误,而异常错误会对后续的数据分析以及人工智能过程产生不可小视的影响,因此在数据应用之前需要对数据进行相应的数据清洗修复.现存的平滑修复方法会导致大量原本正确的数据点过度修复为异常值,而基于约束的顺序依赖方法以及SCREEN方法等也因为约束条件较为单薄而无法对复杂的数据情况进行精确修复.本文基于最小修复原则进一步提出了多区间速度约束下的时间序列数据修复方法,并采用动态规划方法来求解最优修复路径.具体来说,本文提出了多个速度区间来对时序数据进行约束,并根据多速度约束对各数据点形成一系列修复候选点,进而基于动态规划方法从中选取最优修复解.为验证上述方法的可行性和有效性,本文采用一个人工数据集,两个真实数据集以及一个带有真实错误的数据集在不同的异常率及数据量下对上述方法进行实验.由实验结果可知,相较于其他现存的修复方法,本文方法在修复结果及时间开销方面均有着较好的表现.进一步,本文对多个数据集通过聚类及分类精确率的验证来表明数据质量问题对后续数据分析及人工智能的影响至关重要,本方法可以提升数据分析及人工智能结果的质量.  相似文献   

2.
随着信息技术的不断进步,数据规模不断增大。聚类是一种典型的数据分析方法,尤其是对大规模数据进行聚类分析近年来备受关注。针对现有序列聚类算法在对大规模数据进行聚类时,在内存空间和计算时间方面开销较大的问题,提出了基于MapReduce的人工蜂群聚类算法,通过引入MapReduce并行编程范式,快速计算聚类中心适应度,可实现对大规模数据的高效聚类。基于仿真和真实的磁盘驱动器制造两类数据,对算法的聚类效果、可扩展性和聚类效率进行了验证。实验结果表明,与现有PK-Means算法和并行K-PSO算法相比,论文算法具有更好的聚类效果、更强的扩展性和更高的聚类效率。  相似文献   

3.
在业务流程执行过程中,由于信息系统故障或者人工记录出错等问题导致事件日志中数据的丢失,从而产生缺失的事件日志,使用这种缺失日志会严重影响业务流程分析结果的质量。针对这种缺失日志的修复问题,现有的研究大部分仅从数据视角或者行为视角进行展开,很少从数据和行为相融合的视角开展事件日志的修复工作。提出了一种基于BERT模型的多视角事件日志修复方法。该方法利用双层BERT模型,从数据和行为融合的视角训练模型,通过BERT模型的预训练任务(masked attribute model,MAM)和(masked event model,MEM)以及Transformer编码块的注意力机制,捕获输入属性的双向语义信息和长期依赖关系,使用微调策略进行模型训练,以预测的形式修复事件日志中的缺失值。最后,通过公开可用的数据集进行评估分析,结果表明,该方法在修复事件日志方面表现良好。  相似文献   

4.
The paper proposes a new method for variable selection for prediction settings and soft sensors applications. The new variable selection method is based on the multi-layer perceptron (MLP) neural network model, where the network is trained a single time, maintaining low computational cost. The proposed method was successfully applied, and compared with four state-of-the-art methods in one artificial dataset and three real-world datasets, two publicly available datasets (Box–Jenkins gas furnace and gas mileage), and a dataset of a problem where the objective is to estimate the fluoride concentration in the effluent of a real urban water treatment plant (WTP). The proposed method presents similar or better approximation performance when compared to the other four methods. In the experiments, among all the five methods, the proposed method selects the lowest number of variables and variables-delays pairs to achieve the best solution. In soft sensors applications having a lower number of variables is a positive factor for decreasing implementation costs, or even making the soft sensor feasible at all.  相似文献   

5.
基于近邻传播算法的半监督聚类   总被引:31,自引:2,他引:29  
肖宇  于剑 《软件学报》2008,19(11):2803-2813
提出了一种基于近邻传播(affinity propagation,简称AP)算法的半监督聚类方法.AP是在数据点的相似度矩阵的基础上进行聚类.对于规模很大的数据集,AP算法是一种快速、有效的聚类方法,这是其他传统的聚类算法所不能及的,比如:K中心聚类算法.但是,对于一些聚类结构比较复杂的数据集,AP算法往往不能得到很好的聚类结果.使用已知的标签数据或者成对点约束对数据形成的相似度矩阵进行调整,进而达到提高AP算法的聚类性能.实验结果表明,该方法不仅提高了AP对复杂数据的聚类结果,而且在约束对数量较多时,该方法要优于相关比对算法.  相似文献   

6.
“Dimensionality” is one of the major problems which affect the quality of learning process in most of the machine learning and data mining tasks. Having high dimensional datasets for training a classification model may lead to have “overfitting” of the learned model to the training data. Overfitting reduces generalization of the model, therefore causes poor classification accuracy for the new test instances. Another disadvantage of dimensionality of dataset is to have high CPU time requirement for learning and testing the model. Applying feature selection to the dataset before the learning process is essential to improve the performance of the classification task. In this study, a new hybrid method which combines artificial bee colony optimization technique with differential evolution algorithm is proposed for feature selection of classification tasks. The developed hybrid method is evaluated by using fifteen datasets from the UCI Repository which are commonly used in classification problems. To make a complete evaluation, the proposed hybrid feature selection method is compared with the artificial bee colony optimization, and differential evolution based feature selection methods, as well as with the three most popular feature selection techniques that are information gain, chi-square, and correlation feature selection. In addition to these, the performance of the proposed method is also compared with the studies in the literature which uses the same datasets. The experimental results of this study show that our developed hybrid method is able to select good features for classification tasks to improve run-time performance and accuracy of the classifier. The proposed hybrid method may also be applied to other search and optimization problems as its performance for feature selection is better than pure artificial bee colony optimization, and differential evolution.  相似文献   

7.
刘波  蔡美  周绪川 《计算机科学》2016,43(1):232-236, 241
在数据库以及集成系统中通常存在违背数据约束的不一致查询问题。修复是解决该问题的主要手段之一,但目前还缺乏基于修复、约束与查询的统一模型研究。提出了基于删除元组修复、满足多种类型约束的一致性查询算法;阐明了具有简洁特性的约束定义与查询语句结构;构建了新的查询与修复系统模型,将关系实例集、非空的约束集、查询定义、修复方法等统一到模型中,以产生满足一致性约束要求的查询结果。所研究的方法、语言以及模型通用性强、适用面广,不局限于特定质量问题的修复与查询。  相似文献   

8.
姜逸凡  叶青 《计算机应用》2019,39(4):1041-1045
在时间序列分类等数据挖掘工作中,不同数据集基于类别的相似性表现有明显不同,因此一个合理有效的相似性度量对数据挖掘非常关键。传统的欧氏距离、余弦距离和动态时间弯曲等方法仅针对数据自身进行相似度公式计算,忽略了不同数据集所包含的知识标注对于相似性度量的影响。为了解决这一问题,提出基于孪生神经网络(SNN)的时间序列相似性度量学习方法。该方法从样例标签的监督信息中学习数据之间的邻域关系,建立时间序列之间的高效距离度量。在UCR提供的时间序列数据集上进行的相似性度量和验证性分类实验的结果表明,与ED/DTW-1NN相比SNN在分类质量总体上有明显的提升。虽然基于动态时间弯曲(DTW)的1近邻(1NN)分类方法在部分数据上表现优于基于SNN的1NN分类方法,但在分类过程的相似度计算复杂度和速度上SNN优于DTW。可见所提方法能明显提高分类数据集相似性的度量效率,在高维、复杂的时间序列的数据分类上有不错的表现。  相似文献   

9.
Inductive Logic Programming (ILP) combines rule-based and statistical artificial intelligence methods, by learning a hypothesis comprising a set of rules given background knowledge and constraints for the search space. We focus on extending the XHAIL algorithm for ILP which is based on Answer Set Programming and we evaluate our extensions using the Natural Language Processing application of sentence chunking. With respect to processing natural language, ILP can cater for the constant change in how we use language on a daily basis. At the same time, ILP does not require huge amounts of training examples such as other statistical methods and produces interpretable results, that means a set of rules, which can be analysed and tweaked if necessary. As contributions we extend XHAIL with (i) a pruning mechanism within the hypothesis generalisation algorithm which enables learning from larger datasets, (ii) a better usage of modern solver technology using recently developed optimisation methods, and (iii) a time budget that permits the usage of suboptimal results. We evaluate these improvements on the task of sentence chunking using three datasets from a recent SemEval competition. Results show that our improvements allow for learning on bigger datasets with results that are of similar quality to state-of-the-art systems on the same task. Moreover, we compare the hypotheses obtained on datasets to gain insights on the structure of each dataset.  相似文献   

10.
Forecasting the future values of a time series is a common research topic and is studied using probabilistic and non-probabilistic methods. For probabilistic methods, the autoregressive integrated moving average and exponential smoothing methods are commonly used, whereas for non-probabilistic methods, artificial neural networks and fuzzy inference systems (FIS) are commonly used. There are numerous FIS methods. While most of these methods are rule-based, there are a few methods that do not require rules, such as the type-1 fuzzy function (T1FF) approach. While it is possible to encounter a method such as an autoregressive (AR) model integrated with a T1FF, no method that includes T1FF and the moving average (MA) model in one algorithm has yet been proposed. The aim of this study is to improve forecasting by taking the disturbance terms into account. The input dataset is organized using the following variables. First, the lagged values of the time series are used for the AR model. Second, a fuzzy c-means clustering algorithm is used to cluster the inputs. Third, for the MA, the residuals of fuzzy functions are used. Hence, AR, MA, and the degree of memberships of the objects are included in the input dataset. Because the objective function is not derivative, particle swarm optimization is preferable for solving it. The results on several datasets show that the proposed method outperforms most of the methods in literature.  相似文献   

11.
为得到好的聚类效果,需要挑选适合数据集簇结构的聚类算法。文中提出基于网格最小生成树的聚类算法选择方法,为给定数据集自动选择适合的聚类算法。该方法首先在数据集上构建出网格最小生成树,由树的数目确定数据集的潜在簇结构,然后为数据集选择适合所发现簇结构的聚类算法。实验结果表明该方法较有效,能为给定数据集找出适合其潜在簇结构的聚类算法。  相似文献   

12.
时序NDVI数据集重建方法评价与实例研究   总被引:14,自引:1,他引:13       下载免费PDF全文
时序NDVI数据集已经成功地应用于全球与区域环境变化、植被动态变化、土地覆盖变化和植物生物物理量参数反演等方面的研究。受到大气条件和传感器自身因素的制约,虽然经过严格的预处理,时序NDVI数据集仍包含很多噪声,影响其进一步的应用。首先介绍了近几年来普遍使用的6种时序NDVI数据集的重建方法:改进的最佳指数斜率提取法、均值迭代滤波法、Savitzky-Golay滤波法、傅立叶变换法、非对称高斯函数拟合法和时间序列谐波分析法 |然后采用这几种方法对张掖地区2007年和2008年10 d最大值合成的SPOT/VEGETATION的时序NDVI数据进行了重建,对重建结果进行了比较和评价 |最后对人为的噪声序列进行重建,对重建结果的优缺点进行评价。  相似文献   

13.
针对基于u-shapelets的时间序列聚类中u-shapelets集合质量较低的问题,提出一种基于最佳u-shapelets的时间序列聚类算法DivUshapCluster。首先,探讨不同子序列质量评估方法对基于u-shapelets的时间序列聚类结果的影响;然后,选用最佳的子序列质量评估方法对u-shapelet候选集进行质量评估;其次,引入多元top-k查询技术对u-shapelet候选集进行去除冗余操作,搜索出最佳的u-shapelets集合;最后,利用最佳u-shapelets集合对原始数据集进行转化,达到提高时间序列聚类准确率的目的。实验结果表明,DivUshapCluster算法在聚类准确度上不仅优于经典的时间序列聚类算法,而且与BruteForce算法和SUSh算法相比,DivUshapCluster算法在22个数据集上的平均聚类准确度分别提高了18.80%和19.38%。所提算法能够在保证整体效率的情况下有效提高时间序列的聚类准确度。  相似文献   

14.
自动程序修复技术可实现对软件缺陷的自动修复, 并使用测试套件评估修复补丁. 然而因为测试套件不充分, 通过测试套件的补丁可能并未正确修复缺陷, 甚至引入新的缺陷并产生波及效应, 导致自动程序修复生成大量过拟合补丁. 针对这个问题, 本文提出了一种基于数据流分析的过拟合补丁识别方法, 首先将补丁对程序的修改分解为对变量的操作, 然后采用数据流分析方法识别补丁影响域, 并根据补丁影响域选择针对性覆盖准则来识别目标覆盖元素, 进而选取测试路径并生成测试用例实现对修复程序的充分测试, 避免修复副作用的影响. 本文在两个数据集上进行了评估, 实验结果表明, 基于数据流分析的过拟合补丁识别方法可有效提升自动程序修复的正确性.  相似文献   

15.
徐耀丽  李战怀  陈群  钟评 《软件学报》2016,27(7):1685-1699
针对关系数据的不一致性虽然已有各种修复方法被提出,但这些修复策略在构建最终修复方案过程中只分析函数依赖包含属性的信息(即数据集的部分信息),且偏向于修复代价最小的方案,而忽略了数据集的其它属性以及这些属性与函数依赖包含属性之间的相关性。为此,本文提出一种基于可能世界模型的不一致性修复方法。它首先构造可能的修复方案,然后从修复代价和属性值相关性二个方面量化各个候选修复方案的可信性程度,并最后找出最优的修复方案。实验结果验证了本文提出的修复方法取得了比现有基于代价的修复方法更好的修复效果。我们同时也分析了错误率和不同类型概率量化对本文提出的修复方法的影响。  相似文献   

16.
Tool breakage monitoring (TBM) during milling operations is crucial for ensuring workpiece quality and minimizing economic losses. Under the premise of sufficient training data with a balanced distribution, TBM methods based on statistical analysis and artificial intelligence enable accurate recognition of tool breakage conditions. However, considering the actual manufacturing safety, cutting tools usually work in normal wear conditions, and acquiring tool breakage signals is extremely difficult. The data imbalance problem seriously affects the recognition accuracy and robustness of the TBM model. This paper proposes a TBM method based on the auxiliary classier Wasserstein generative adversarial network with gradient penalty (ACWGAN-GP) from the perspective of data generation. By introducing Wasserstein distance and gradient penalty terms into the loss function of ACGAN, ACWGAN-GP can generate multi-class fault samples while improving the network's stability during adversarial training. A sample filter based on multiple statistical indicators is designed to ensure the quality and diversity of the generated data. Qualified samples after quality assessment are added to the original imbalanced dataset to improve the tool breakage classifier's performance. Artificially controlled face milling experiments for TBM are carried out on a five-axis CNC machine to verify the effectiveness of the proposed method. Experimental results reveal that the proposed method outperforms other popular imbalance fault diagnosis methods in terms of data generation quality and TBM accuracy, and can meet the real-time requirements of TBMs.  相似文献   

17.
邹朋成  王建东  杨国庆  张霞  王丽娜 《软件学报》2013,24(11):2642-2655
对于时间序列聚类任务而言,一个有效的距离度量至关重要.为了提高时间序列聚类的性能,考虑借助度量学习方法,从数据中学习一种适用于时序聚类的距离度量.然而,现有的度量学习未注意到时序的特性,且时间序列数据存在成对约束等辅助信息不易获取的问题.提出一种辅助信息自动生成的时间序列距离度量学习(distancemetric learning based on side information autogeneration for time series,简称SIADML)方法.该方法利用动态时间弯曲(dynamic time warping,简称DTW)距离在捕捉时序特性上的优势,自动生成成对约束信息,使习得的度量尽可能地保持时序之间固有的近邻关系.在一系列时间序列标准数据集上的实验结果表明,采用该方法得到的度量能够有效改善时间序列聚类的性能.  相似文献   

18.
大数据处理分析算法在优化研究过程中,速度常常受限于数据集的规模。在数据集体量不足时,算法的通信时间往往要高于真正的计算时间,无法验证真实的效果。故设计实现了一个大数据集生成器,为运行在超级计算机上的并行大数据处理分析算法提供基准测试数据集。首先,使用MPI并行编程技术构造了一个并行随机数生成器,在此基础上设计实现了可控制规模及复杂性的人工数据集,主要包括:分类和聚类数据集、回归数据集、流形学习数据集和因子分解数据集等。其次,设计了大数据集生成器的I/O系统,提供MPI-I/O并行读、写数据集的接口,并设置了数据集在不同进程间的分发、映射规则,通过点对点通信实现不同节点之间的数据交互。实验结果表明,并行大数据集生成器有效提高了数据生成效率和生成规模,为并行大数据处理分析算法提供了高质量、大体量的测试数据集。  相似文献   

19.
Today's e-commerce is highly depended on increasingly growing online customers’ reviews posted in opinion sharing websites. This fact, unfortunately, has tempted spammers to target opinion sharing websites in order to promote and demote products. To date, different types of opinion spam detection methods have been proposed in order to provide reliable resources for customers, manufacturers and researchers. However, supervised approaches suffer from imbalance data due to scarcity of spam reviews in datasets, rating deviation based filtering systems are easily cheated by smart spammers, and content based methods are very expensive and majority of them have not been tested on real data hitherto.The aim of this paper is to propose a robust review spam detection system wherein the rating deviation, content based factors and activeness of reviewers are employed efficiently. To overcome the aforementioned drawbacks, all these factors are synthetically investigated in suspicious time intervals captured from time series of reviews by a pattern recognition technique. The proposed method could be a great asset in online spam filtering systems and could be used in data mining and knowledge discovery tasks as a standalone system to purify product review datasets. These systems can reap benefit from our method in terms of time efficiency and high accuracy. Empirical analyses on real dataset show that the proposed approach is able to successfully detect spam reviews. Comparison with two of the current common methods, indicates that our method is able to achieve higher detection accuracy (F-Score: 0.86) while removing the need for having specific fields of Meta data and reducing heavy computation required for investigation purposes.  相似文献   

20.
Clustering of traffic data based on correlation analysis is an important element of several network management objectives including traffic shaping and quality of service control. Existing correlation-based clustering algorithms are affected by poor results when applied to highly variable time series characterizing most network traffic data. This paper proposes a new similarity measure for computing clusters of highly variable data on the basis of their correlation. Experimental evaluations on several synthetic and real datasets show the accuracy and robustness of the proposed solution that improves existing clustering methods based on statistical correlations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号