首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 203 毫秒
1.
曲武  王莉军  韩晓光 《计算机科学》2014,41(11):195-202
近年来,随着计算机技术、信息处理技术在工业生产、信息处理等领域的广泛应用,会连续不断地产生大量随时间演变的序列型数据,构成时间序列数据流,如互联网新闻语料分析、网络入侵检测、股市行情分析和传感器网络数据分析等。实时数据流聚类分析是当前数据流挖掘研究的热点问题。单遍扫描算法虽然满足数据流高速、数据规模较大和实时分析的需求,但因缺乏有效的聚类算法来识别和区分模式而限制了其有效性和可扩展性。为了解决以上问题,提出云环境下基于LSH的分布式数据流聚类算法DLCStream,通过引入Map-Reduce框架和位置敏感哈希机制,DLCStream算法能够快速找到数据流中的聚类模式。通过详细的理论分析和实验验证表明,与传统的数据流聚类框架CluStream算法相比,DLCStream算法在高效并行处理、可扩展性和聚类结果质量方面更有优势。  相似文献   

2.
知识的获取、知识库的更新是案例推理技术的应用瓶颈,而许多案例推理系统中的知识库都是静态不变的,满足不了实际问题变化的需要。首先阐述了相关概念,接着提出了一种基于动态数据流挖掘的案例推理模型,其中动态数据流挖掘算法采用改进的数据流聚类算法。通过此模型使用基于动态数据流挖掘的案例推理技术,对数据进行实时挖掘,产生连续、动态的临时案例库,实现知识库的实时更新,从而满足实际问题变化的需要。最后通过该模型在实际中的应用说明其有效性。  相似文献   

3.
基于粒子群优化算法的数据流聚类算法   总被引:1,自引:0,他引:1  
肖裕权  周肆清 《微机发展》2011,(10):43-46,50
针对当前基于滑动窗口的聚类算法中对原始数据信息的损失问题和提高聚类质量和准确性,在现有基于滑动窗口模型数据流聚类算法的基础上,提出了一种基于群体协作的粒子群优化算法(PSO)的新数据流聚类算法。这种优化的新数据流聚类算法利用改进的时间聚类特征指数直方图作为数据流的概要结构以及应用PSO在聚类过程中对聚类质量的局部迭代优化。实验结果表明,此方法有效减少了内存的开销,解决了对原始数据信息损失的问题。与传统的数据流聚类算法相比,基于粒子群优化算法的数据流聚类算法在聚类质量和准确性上明显优于传统的数据流聚类算法。  相似文献   

4.
基于密度与近邻传播的数据流聚类算法   总被引:1,自引:0,他引:1  
针对现有算法聚类精度不高、处理离群点能力较差以及不能实时检测数据流变化的缺陷,提出一种基于密度与近邻传播融合的数据流聚类算法.该算法采用在线/离线两阶段处理框架,通过引 入微簇衰减密度来精确反映数据流的演化信息,并采用在线动态维护和删减微簇机制,使算法模型更 符合原始数据流的内在特性.同时,当模型中检测到新的类模式出现时,采用一种改进的加权近邻传播聚类(Weighted and hierarchical affinity propagation,WAP)算法对模 型进行重建,因而能够实时检测到数据流的变化,并能给出任意时间的聚类结果.在真实数据集和人工 数据集上的实验表明,该算法具有良好的适用性、有效性和可扩展性,能够取得较好的聚类效果.  相似文献   

5.
实时数据流聚类是目前国际数据库和数据管理领域的新兴研究热点.综述了实时数据流聚类的最新研究进展,在介绍实时数据流聚类的相关理论和常用技术的基础上,对现有各种代表性算法的优势和不足进行了系统地分析,从处理速度、聚类形状、演化分析、高维性及噪声健壮性5个方面对算法的性能进行了比较.探讨了基于聚类的实时数据流演化分析方法及其局限性.最后展望了将来可能的研究方向.  相似文献   

6.
基于免疫原理的数据流聚类算法   总被引:1,自引:0,他引:1  
由于基于免疫的学习方法能够较好地适应数据流不断变化及高速处理的要求,本文据此提出一种基于免疫原理的数据流聚类算法(AIN-STREAM).该算法能够动态适应数据流的变化,并能有效抑制噪声.AIN-STREAM通过建立与维护B细胞特征向量,从而能够根据用户的要求自动调整B细胞的识别区域,保证聚类结果的稳定性.理论分析和实验结果表明,在聚类结果相当的条件下,MN-STREAM具有比同类算法更高的时间与空间效率,同时具有较高的聚类精度.  相似文献   

7.
基于近邻传播与密度相融合的进化数据流聚类算法   总被引:3,自引:0,他引:3  
邢长征  刘剑 《计算机应用》2015,35(7):1927-1932
针对目前数据流离群点不能很好地被处理、数据流聚类效率较低以及对数据流的动态变化不能实时检测等问题,提出一种基于近邻传播与密度相融合的进化数据流聚类算法(I-APDenStream)。此算法使用传统的两阶段处理模型,即在线与离线聚类两部分。不仅引进了能够体现数据流动态变化的微簇衰减密度以及在线动态维护微簇的删减机制,而且在对模型采用扩展的加权近邻传播(WAP)聚类进行模型重建时,还引进了异常点检测删除机制。通过在两种类型数据集上的实验结果表明,所提算法的聚类准确率基本能保持在95%以上,其纯度对比实验等其他相关测试都有较好结果,能够高实效、高质量、高效率地处理数据流数据聚类。  相似文献   

8.
为了提高客服中心的智能管理和信息调度能力,结合大数据分析方法进行客服中心实时数据监测和自动采集设计。提出一种基于模糊规则特征量挖掘和层次分析聚类的客服中心实时数据流自动监测方法。建立客服中心的网格分布结构模型并进行客服中心实时数据流监测统计特征分析,进行客服中心实时监测数据属性集的向量量化特征分解,对客服中心实时数据采用信息融合和模糊层析性分析方法实现信息融合,进行关联数据自适应特征提取,挖掘客服中心实时监测数据信息流的正相关性特征量。在层次性聚类算法基础上采用自回归分析进行客服中心实时数据流的模糊聚类和信息预测,提高客服中心实时数据监测的准确性,同时降低了客服服务中心数据流监测的风险。仿真结果表明,采用该方法进行客服中心实时数据监测的聚类性较高,预测性较好,能降低数据聚类的误分率,提高了客服中心实时数据监测能力。  相似文献   

9.
数据流子空间聚类的主要目的是在合理的时间段内准确找到数据流特征子空间中的聚类.现有的数据流子空间聚类算法受参数影响较大,通常要求预先给出聚类数目或特征子空间,且聚类结果不能及时反映数据流的变化情况.针对以上缺陷,提出一种新的数据流子空间聚类算法SC-RP,SC-RP无需预先给出聚类数目或特征子空间,对孤立点不敏感,可实现快速聚类,通过区域树结构记录数据流的变化并及时更新统计信息,进而根据数据流的变化调整聚类结果.通过在真实数据集与仿真数据集上的实验,证明了SC-RP在聚类精度和速度上优于现有的数据流子空间聚类算法,且对聚类数目及数据维度均具有良好的伸缩性.  相似文献   

10.
陈崚  邹凌君  屠莉 《计算机应用》2007,27(8):1976-1979
针对当前对多条数据流的聚类算法不能兼顾质量和效率的矛盾,提出了基于相关系数的多条数据流的聚类算法,实现固定长度的在线动态聚类。算法引入衰减系数提高聚类质量,以相关系数作为流数据间相似度的度量标准,将数据流划分若干个数据段,以各数据流的相关统计信息进行聚类,得到实时的聚类结构。实验结果表明,算法有较高的效率、聚类质量和稳定性。  相似文献   

11.
On change diagnosis in evolving data streams   总被引:1,自引:0,他引:1  
In recent years, the progress in hardware technology has made it possible for organizations to store and record large streams of transactional data. This results in databases which grow without limit at a rapid rate. This data can often show important changes in trends over time. In such cases, it is useful to understand, visualize, and diagnose the evolution of these trends. In this paper, we introduce the concept of velocity density estimation, a technique used to understand, visualize, and determine trends in the evolution of fast data streams. We show how to use velocity density estimation in order to create both temporal velocity profiles and spatial velocity profiles at periodic instants in time. These profiles are then used in order to predict three kinds of data evolution: dissolution, coagulation, and shift. Methods are proposed to visualize the changing data trends in a single online scan of the data stream and a computational requirement which is linear in the number of data points. The visualization techniques can also be used to provide online animations which show the changes in the data characteristics while they occur. In addition, batch processing techniques are proposed in order to quantify the level of change across different combinations of dimensions. This quantification is then used in order to determine dimensional combinations with significant evolution. The techniques discussed in this paper can be easily extended to spatiotemporal data, changes in data snapshots at fixed instances in time, or any other data which has a temporal component during its evolution.  相似文献   

12.
高维数据流聚类及其演化分析研究   总被引:5,自引:0,他引:5  
基于数据流数据的聚类分析算法已成为研究的热点.提出一种基于子空间的高维数据流聚类及演化分析算法CAStream,该算法对数据空间进行网格化,采用近似的方法记录网格单元的统计信息,并将潜在密集网格单元快照以改进的金字塔时间结构进行存储,最后采用深度优先搜索方法进行聚类及其演化分析.CAStream能够有效处理高雏数据流,并能发现任意形状分布的聚类.基于真实数据集与仿真数据集的实验表明,算法具有良好的适用性和有效性.  相似文献   

13.
为了提高进化数据流的聚类质量,提出基于半监督近邻传播的数据流聚类算法(SAPStream),该算法借鉴半监督聚类的思想对初始数据流构造相似度矩阵进行近邻传播聚类,建立在线聚类模型,随着数据流的进化,应用衰减窗口技术对聚类模型适时做出调整,对产生的类代表点和新到来的数据点再次聚类得到数据流的聚类结果。对数据流进行动态聚类的实验结果表明该算法是高质有效的。  相似文献   

14.
Data Streams have become ubiquitous in recent years because of advances in hardware technology which have enabled automated recording of large amounts of data. The primary constraint in the effective mining of streams is the large volume of data which must be processed in real time. In many cases, it is desirable to store a summary of the data stream segments in order to perform data mining tasks. Since density estimation provides a comprehensive overview of the probabilistic data distribution of a stream segment, it is a natural choice for this purpose. A direct use of density distributions can however turn out to be an inefficient storage and processing mechanism in practice. In this paper, we introduce the concept of cluster histograms, which provides an efficient way to estimate and summarize the most important data distribution profiles over different stream segments. These profiles can be constructed in a supervised or unsupervised way depending upon the nature of the underlying application. The profiles can also be used for change detection, anomaly detection, segmental nearest neighbor search, or supervised stream segment classification. Furthermore, these techniques can also be used for modeling other kinds of data such as text and categorical data. The flexibility of the tasks which can be performed from the cluster histogram framework follows from its generality in storing the historical density profile of the data stream. As a result, this method provides a holistic framework for density-based mining of data streams. We discuss and test the application of the cluster histogram framework to a variety of interesting data mining applications.  相似文献   

15.
在研究已有时间序列数据流预测方法的前提下,给出了一种基于滑动窗口的时间序列数据流通用预测模型,提出能有效降噪并进行多尺度滑动窗口分析,进而进行预测的新方法Online-HHT,将数据流中的滑动窗口技术与HHT方法相结合从而达到在线分析的目的。使用此模型,通过实验证实了Online-HHT方法能够有效地对时序数据流进行在线自适应趋势预测。  相似文献   

16.
陈小东  孙力娟  韩崇  郭剑 《计算机科学》2016,43(4):219-223, 251
针对数据流中可能出现的概念漂移现象,采用改进的FCM算法进行模糊聚类,提出在大小可变的滑动窗口中通过度量相邻窗口之间的差异性来判断是否发生了概念漂移,并给出了相应的处理方法。实验表明该算法能够有效地检测出数据流中的概念漂移现象,具有很好的聚类效果和很高的时间效率。  相似文献   

17.
Performing data mining tasks in streaming data is considered a challenging research direction, due to the continuous data evolution. In this work, we focus on the problem of clustering streaming time series, based on the sliding window paradigm. More specifically, we use the concept of subspace αα-clusters. A subspace αα-cluster consists of a set of streams, whose value difference is less than αα in a consecutive number of time instances (dimensions). The clusters can be continuously and incrementally updated as the streaming time series evolve with time. The proposed technique is based on a careful examination of pair-wise stream similarities for a subset of dimensions and then it is generalized for more streams per cluster. Additionally, we extend our technique in order to find maximal pClusters in consecutive dimensions that have been used in previously proposed clustering methods. Performance evaluation results, based on real-life and synthetic data sets, show that the proposed method is more efficient than existing techniques. Moreover, it is shown that the proposed pruning criteria are very important for search space reduction, and that the cost of incremental cluster monitoring is more computationally efficient that the re-clustering process.  相似文献   

18.
Many database applications require efficient processing of data streams with value variations and fluctuant sampling frequency.The variations typically imply fundamental features of the stream and important domain knowledge of underlying objects.In some data streams,successive events seem to recur in a certain time interval,but the data indeed evolves with tiny differences as time elapses.This feature,so called pseudo periodicity,poses a new challenge to stream variation management.This study focuses on the online management for variations over such streams.The idea can be applied to many scenarios such as patient vital signal monitoring in medical applications.This paper proposes a new method named Pattern Growth Graph (PGG) to detect and manage variations over evolving streams with following features:1) adopts the wave-pattern to capture the major information of data evolution and represent them compactly; 2) detects the variations in a single pass over the stream with the help of wave-pattern matching algorithm;3) only stores different segments of the pattern for incoming stream,and hence substantially compresses the data without losing important information;4) distinguishes meaningful data changes from noise and reconstructs the stream with acceptable accuracy. Extensive experiments on real datasets containing millions of data items,as well as a prototype system,are carried out to demonstrate the feasibility and effectiveness of the proposed scheme.  相似文献   

19.
分析了数据流降维算法PCA和KPCA的原理和实现方法。针对在大型数据集上PCA线性降维无法有效实现降维且KPCA的降维效率差,提出了一种新的降维策略GKPCA算法。该算法将数据集先分组,对每一组执行KPCA,然后过滤重新组合数据集,再次应用KPCA算法,达到简化样本空间,降低了时间复杂度和空间复杂度。实验分析表明,GKPCA算法不仅能取得良好的降维效果,而且时间消耗少。  相似文献   

20.
Usually the data generation rate of a data stream is unpredictable, and some data elements of the data stream cannot be processed in real time if the generation rate exceeds the capacity of a data stream processing algorithm. In order to overcome this situation gracefully, a load shedding technique is recommended. This paper proposes a frequency-based load shedding technique over a data stream of tuples. In many data stream processing applications, such as mining frequent patterns, data elements having high frequency can be considered more significant than others having low frequency. Based on this observation, in the proposed technique, only frequent elements of a data stream are processed in real time while the others are trimmed. The decision to shed a load from the data stream or not is controlled automatically by the data generation rate of a data stream. Consequently, an unnecessary load shedding operation is not allowed in the proposed technique.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号