首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 155 毫秒
1.
基于Web日志挖掘的个性化服务站点   总被引:2,自引:1,他引:2  
介绍个性化站点的概念,并对Web日志挖掘系统体系结构进行分析。其后将关联规则挖掘技术应用到日志事务会话中,在对日志数据的特性分析的基础上提出类Apriori挖掘算法。对类Apriori挖掘算法得到的频繁项集如何有效提取关联规则提出了最有效的方法。在实际应用中探讨了如何从多个匹配的关联规则中选择合适的匹配规则。  相似文献   

2.
关联规则挖掘是最常用、最重要的数据挖掘任务之一,经典的关联规则挖掘算法有Apriori、FP-Growth、Eclat等。随着数据的爆炸式增长,传统的算法已不能适应大数据挖掘的需要,需要分布式、并行的关联规则挖掘算法来解决上述问题。MapReduce是一种流行的分布式并行计算模型,因其使用简单、伸缩性好、自动负载均衡和自动容错等优点,得到了广泛的应用。本文对已有的基于MapReduce计算模型的并行关联规则挖掘算法进行了分类和综述,对其各自的优缺点和适用范围进行了总结,并对下一步的研究进行了展望。  相似文献   

3.
张琰 《网友世界》2012,(12):4-6
数据挖掘技术能使我们从模糊的、不完全的、随机的、大量的数据中,提取潜在的有用的信息和知识。经过20几年的发展,数据挖掘已取得了巨大成就。Web挖掘是数据挖掘技术的一个重要分支,它是随着人工智能技术、数据库技术和网络技术的发展而提出来的。本文针对Web日志挖掘的特性,对关联规则的挖掘算法进行深入的研究,系统地探讨了关联规则挖掘算法在Web日志挖掘的应用。利用优化Apriori算法,使之更具有效率。  相似文献   

4.
随着数据库技术的迅速发展,数据的存储数量与日俱增,从而使得数据挖掘技术的重要性日益加强,关联规则挖掘是数据挖掘中最活跃的研究方法之一。该文先介绍了关联规则挖掘的研究情况,进一步提出和实现了一种有效的基于矩阵的Apriori改进算法,最后探讨和实现Apriori算法在商务中的应用。  相似文献   

5.
在现有的网格和数据挖掘技术基础上,研究和分析了知识网格的整体框架,以及该框架下语义学习过程,基于知识网格平台下的语义学习思想,对经典的关联规则算法Apriori算法作了改进,使之适用于知识网格平台下的分布式关联规则挖掘,并对该算法进行了分析、测试和评估.  相似文献   

6.
关联规则挖掘技术在煤矿安全预警系统中的应用研究   总被引:1,自引:0,他引:1  
针对煤矿安全预警数据多源异构的特点,文章提出了一种采用关联规则挖掘技术对煤矿安全预警系统参数进行数据挖掘的设计方案,给出了关联规则挖掘模型及其算法分析,详细介绍了关联规则算法中的Apriori算法在煤矿安全预警系统中的应用及实现。仿真结果表明,该方案性能良好,是煤矿安全综合评价的一种有效方法。  相似文献   

7.
为满足日益增长的海量数据挖掘需求,迫切需要设计一种能够在多台机器上运行的分布式关联规则挖掘算法。Apriori这种高度迭代算法在Hadoop平台上运行时每次迭代执行大量的磁盘I/O操作,大大影响并限制了算法的运行效率。本文利用Spark对分布式计算内置支持的特点,在Spark平台上设计并实现一种分布式关联规则挖掘算法,称为阶段式自适应挖掘算法(Staged Adaptive Apriori)。算法使用自适应的数据集部分处理的策略对频繁项集进行高效挖掘,在每次迭代前初步评估执行时间,并采用较为合适的方法来减少时间和空间的复杂性,是一种基于数据集性质的自适应关联规则挖掘算法。实验结果表明了算法的有效性。  相似文献   

8.
随着信息技术的持续发展和广泛使用,大量的数据不断被收集和存储,对分布的目标数据进行数据挖掘处理任务的规模越来越大,而传统的数据挖掘无法解决分布式海量数据挖掘的问题,分布式系统很难解决异构的操作系统和协议问题.网格技术的发展成熟,使得利用网格环境下强大的资源共享异构虚拟组织实现协同并行数据挖掘成为网格技术应用的一个研究重点.本文提出基于网格环境的Agent技术、多线程和集中表决技术的关联规则并行挖掘方案,并在GT4下实验验证,实现对大规模数据的网格环境分布式并行数据挖掘.  相似文献   

9.
介绍了隐私保护数据挖掘方法的产生背景和意义,其次概括了现阶段国内外隐私保护数据挖掘算法的研究现状,并对当前隐私保护数据挖掘领域中已提出的算法按照数据挖掘的方法、数据源分布情况、隐私保护技术和隐私保护对象以及数据挖掘应用类型等方面进行分类,然后分别详细阐述了在集中式和分布式数据分布环境下,应用在隐私保护的关联规则挖掘、分类和聚类挖掘中的一些典型的技术和算法,总结出它们的优缺点,并对这些优缺点进行剖析和对比,最后指明了隐私保护数据挖掘算法在未来的整体发展方向.  相似文献   

10.
聚类后的关联规则快速更新算法研究*   总被引:1,自引:0,他引:1  
关联规则和聚类分析是数据挖掘中重要的研究课题。通过对关联规则挖掘算法Apriori算法进行分析与研究,指出了其在实用中存在的两个主要问题。鉴于此,在分析聚类分析和关联规则两种挖掘算法的基础上,讨论了将这两种独立的挖掘方法集成起来的联合挖掘,使其可以有效地压缩数据规模。给出了聚类后的关联规则快速更新算法描述。实验结果表明,算法性能优良,提高了数据挖掘执行效率。  相似文献   

11.
为了克服传统数据挖掘算法与分布式数据挖掘算法的不足.提出了一种基于网格平台的数据挖掘算法,并改进了原有的Apriori算法,使其应用于网格平台。基于网格的数据挖掘算法具有合并计算力,安全,高效,节约硬件成本等优势已越来越受到学术界的重视。  相似文献   

12.
Distributed data mining implements techniques for analyzing data on distributed computing systems by exploiting data distribution and parallel algorithms. The grid is a computing infrastructure for implementing distributed high‐performance applications and solving complex problems, offering effective support to the implementation and use of data mining and knowledge discovery systems. The Web Services Resource Framework has become the standard for the implementation of grid services and applications, and it can be exploited for developing high‐level services for distributed data mining applications. This paper describes how distributed data mining patterns, such as collective learning, ensemble learning, and meta‐learning models, can be implemented as Web Services Resource Framework mining services by exploiting the grid infrastructure. The goal of this work was to design a distributed architectural model that can be exploited for different distributed mining patterns deployed as grid services for the analysis of dispersed data sources. In order to validate such an approach, we presented also the implementation of two clustering algorithms on the developed architecture. In particular, the distributed k‐means and distributed expectation maximization were exploited as pilot examples to show the suitability of the implemented service‐oriented framework. An extensive evaluation of its performance was provided. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

13.
The service‐oriented architecture paradigm can be exploited for the implementation of data and knowledge‐based applications in distributed environments. The Web services resource framework (WSRF) has recently emerged as the standard for the implementation of Grid services and applications. WSRF can be exploited for developing high‐level services for distributed data mining applications. This paper describes Weka4WS, a framework that extends the widely used open source Weka toolkit to support distributed data mining on WSRF‐enabled Grids. Weka4WS adopts the WSRF technology for running remote data mining algorithms and managing distributed computations. The Weka4WS user interface supports the execution of both local and remote data mining tasks. On every computing node, a WSRF‐compliant Web service is used to expose all the data mining algorithms provided by the Weka library. The paper describes the design and implementation of Weka4WS using the WSRF libraries and services provided by Globus Toolkit 4. A performance analysis of Weka4WS for executing distributed data mining tasks in different network scenarios is presented. Copyright © 2008 John Wiley & Sons, Ltd.  相似文献   

14.
在分析传统分布式数据挖掘平台不足的基础上,结合网格服务的思想,提出了基于网格服务的分布式数据挖掘平台,同时在该平台上,实现了分布式BP网络分类算法(GBPC-GS)。仿真实验表明,与单机环境相比,随着网格节点数增加,算法的平均耗时明显下降,同时CPU的负载也下降了约40%。  相似文献   

15.
面向服务的云数据挖掘引擎的研究   总被引:1,自引:0,他引:1  
数据挖掘算法处理海量数据时,扩展性受到制约。在商业和科学研究的各个领域,知识发现的过程和需求差异较大,需要有效的机制来设计和运行各种类型的分布式数据挖掘应用。提出了一种面向服务的云数据挖掘引擎的框架CloudDM。不同于基于网格的分布式数据挖掘框架,CloudDM利用开源云计算平台Hadoop处理海量数据的能力,以面向服务的形式支持分布式数据挖掘应用的设计和运行,并描述面向服务的云数据挖掘引擎系统的关键部件和实现技术。依据面向服务的软件体系结构和基于云平台的数据挖掘引擎,可以有效解决海量数据挖掘中的海量数据存储、数据处理和数据挖掘算法互操作性等问题。  相似文献   

16.
Privacy preserving data mining has become increasingly popular because it allows sharing of privacy-sensitive data for analysis purposes. However, existing techniques such as random perturbation do not fare well for simple yet widely used and efficient Euclidean distance-based mining algorithms. Although original data distributions can be pretty accurately reconstructed from the perturbed data, distances between individual data points are not preserved, leading to poor accuracy for the distance-based mining methods. Besides, they do not generally focus on data reduction. Other studies on secure multi-party computation often concentrate on techniques useful to very specific mining algorithms and scenarios such that they require modification of the mining algorithms and are often difficult to generalize to other mining algorithms or scenarios. This paper proposes a novel generalized approach using the well-known energy compaction power of Fourier-related transforms to hide sensitive data values and to approximately preserve Euclidean distances in centralized and distributed scenarios to a great degree of accuracy. Three algorithms to select the most important transform coefficients are presented, one for a centralized database case, the second one for a horizontally partitioned, and the third one for a vertically partitioned database case. Experimental results demonstrate the effectiveness of the proposed approach.  相似文献   

17.
通过数据概化,在多维属性的属性值概念分层上构造少量的具有抽象语义的元组来替换大量具有详细语义的原始元组,从而汇总数据表,这称作表语义汇总。给定原始数据表及其多维属性的属性值的概念分层,表语义汇总的目标是产生规定压缩率且保留尽可能多的语义信息的汇总表。现有算法采用在概化元组集合中寻找最佳概化元组组合的策略将其转换成Set-Covering问题来解决,尽管采取了多种优化策略(如预处理、分级处理)来提高效率,但仍存在转换开销大、算法框架复杂且不易扩展到高维属性等缺点。通过定义多维属性层次结构的度量空间将该问题转换为多维层次空间聚类问题并引入dewey编码来提高转换效率,提出了基于快速收敛的层次凝聚和基于层次空间分辨率调整的两种聚类算法来高效地建立语义汇总表。经真实数据集上的实验表明,新算法在执行效率和汇总质量上都优于现有方法。  相似文献   

18.
Nowadays, high volumes of massive data can be generated from various sources (e.g., sensor data from environmental surveillance). Many existing distributed frequent itemset mining algorithms do not allow users to express the itemsets to be mined according to their intention via the use of constraints. Consequently, these unconstrained mining algorithms can yield numerous itemsets that are not interesting to users. Moreover, due to inherited measurement inaccuracies and/or network latencies, the data are often riddled with uncertainty. These call for both constrained mining and uncertain data mining. In this journal article, we propose a data-intensive computer system for tree-based mining of frequent itemsets that satisfy user-defined constraints from a distributed environment such as a wireless sensor network of uncertain data.  相似文献   

19.
Existing parallel algorithms for association rule mining have a large inter-site communication cost or require a large amount of space to maintain the local support counts of a large number of candidate sets. This study proposes a de-clustering approach for distributed architectures, which eliminates the inter-site communication cost, for most of the influential association rule mining algorithms. To de-cluster the database into similar partitions, an efficient algorithm is developed to approximate the shortest spanning path (SSP) to link transaction data together. The SSP obtained is then used to evenly de-cluster the transaction data into subgroups. The proposed approach guarantees that all subgroups are similar to each other and to the original group. Experiment results show that data size and the number of items are the only two factors that determine the performance of de-clustering. Additionally, based on the approach, most of the influential association rule mining algorithms can be implemented in a distributed architecture to obtain a drastic increase in speed without losing any frequent itemsets. Furthermore, the data distribution in each de-clustered participant is almost the same as that of a single site, which implies that the proposed approach can be regarded as a sampling method for distributed association rule mining. Finally, the experiment results prove that the original inadequate mining results can be improved to an almost perfect level.  相似文献   

20.
基于Iceberg概念格叠置半集成的全局闭频繁项集挖掘算法   总被引:2,自引:0,他引:2  
研究专有的分布式数据挖掘算法是提高分布式数据库下数据分析和挖掘的有效方法.结合Iceberg概念格对于频繁项集精简表达的特性和其集成构造过程可并行化的特点,进而实现分布式全局闭频繁项集的挖掘.面对目前仍然缺乏有关Iceberg概念格分布式集成构造研究的文献,本文从理论上分析Iceberg概念格叠置集成构造全局Iceberg概念格的局限性,然后论证了基于Iceberg概念格叠置半集成构造全局Iceberg概念格的可行性,进而提出一个基于Iceberg概念格叠置半集成的频繁概念生长分布算法(Frecogd),并且把它应用于同构分布式环境下的全局闭频繁项集挖掘过程中.实验验证了该算法理论的可行性,同时也揭示了该算法的挖掘效能有待进一步的改进与提高.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号