首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 171 毫秒
1.
保持隐私是未来数据挖掘领域的焦点问题之一,如何在不共享精确数据的条件下,获取准确的数据关系是保持隐私的数据挖掘的首要任务。该文介绍了分布式环境下保持隐私的数据挖掘的基本问题和措施,研究了一种基于向量点积的关联规则挖掘算法,给出了一种安全的向量点积协议。对于垂直划分的分布式数据库,该协议既可用于搜索频繁项集,又能保持各方数据的隐私。  相似文献   

2.
符燕华  顾嗣扬 《计算机应用》2006,26(1):213-0215
利用数量积方法从垂直型分布数据中挖掘关联规则,并且保持其隐私性。给出了数量积算法,分析其安全性,同时还举例说明如何利用数量积算法进行垂直型分布式数据挖掘。  相似文献   

3.
改进Apriori挖掘算法的网格实现   总被引:2,自引:0,他引:2  
殷剑锋  徐建城  李伟强 《计算机仿真》2010,27(2):145-148,268
科学和工商业应用需要分析分布在各异构站点的海量数据。传统的关联规则挖掘算法探讨的对象基本上都是集中式的数据集,对分布式的动态数据库群无能为力,因而迫切需要对分布式数据挖掘算法进行研究探讨。在研究OGSA面向服务的体系结构基础上,将网格技术与数据挖掘技术有机地结合在一起,提出了一种基于网格的分布式关联规则挖掘方法。是改进Apriori挖掘算法在网格环境下的具体应用。仿真实验表明方法具备网格的并行挖掘特性,能够成功实现位于多个异构站点E的分布式数据挖掘,且挖掘速度和运算效率较之集中式Apriori挖掘算法有较大幅度的提高。  相似文献   

4.
基于共享的隐私保护关联规则挖掘   总被引:1,自引:0,他引:1  
随着数据挖掘技术的广泛使用,产生了信息安全和隐私保护的新问题。对当前分布式隐私保护关联规则挖掘的经典算法进行了改进,在不使用当前流行的多方安全计算(SMC)的条件下,用较简单的方法进行隐私保护关联规则挖掘,降低了运算量。同时,在分布式关联规则挖掘的同时,很好地保持了各个站点的数据和信息。  相似文献   

5.
隐私保护数据挖掘是当前数据挖掘领域中一个十分重要的研究问题,其目标是在无法获得原始明文数据时可以进行精确的数据挖掘,且挖掘的规则和知识与明文数据挖掘的结果相同或类似。为了强化数据的隐私保护、提高挖掘的准确度,针对分布式环境下聚类挖掘隐私保护问题,结合完全同态加密、解密算法,提出并实现了一种基于完全同态加密的分布式隐私保护FHE DBIRCH模型。模型中数据集传输采用完全同态加密算法加密、解密,保证原始数据的隐私。理论分析和实验结果表明,FHE-DBIRCH模型不仅具有很好的数据隐私性且保持了聚类精度。  相似文献   

6.
隐私保护是数据挖掘中很有意义的研究方向。M.Kantarcioglu等提出了针对水平分割数据的保持隐私的关联规则挖掘的算法,探讨了如何在两个垂直分布的私有数据库的联合样本集上施行数据挖掘算法,同时保证不向对方泄露任何与结果无关的数据库数据,针对资料分类算法中应用非常普遍的关联规则挖掘算法,利用安全两方计算协议.给出一个保持隐私的关联规则挖掘协议。  相似文献   

7.
分布式数据隐私保护K-均值聚类算法   总被引:2,自引:0,他引:2  
如何获取准确的数据关系而不泄露合作方的任何私有数据是分布式数据挖掘隐私保护首要任务.将安全多方计算与数据挖掘技术相结合,提出应用于水平分布和垂直分布类型的数据的隐私保护k-均值聚类算法.实验表明算法能有效的保护数据的隐私,且对聚类结果没有影响.  相似文献   

8.
一种基于关键点的时间序列聚类算法   总被引:1,自引:1,他引:0  
谢福鼎  李迎  孙岩  张永 《计算机科学》2012,39(3):160-162
隐私保护数据挖掘是在不精确访问原始数据的基础上,挖掘出准确的规则和知识。针对分布式环境下聚类挖掘算法的隐私保护问题,提出了一种基于完全同态加密的分布式聚类挖掘算法(FHE-DK-MEANS算法)。理论分析和实验结果表明,FHE-DK-MEANS算法不仅具有很好的数据隐私性,而且保持了聚类精度。  相似文献   

9.
由于云计算的诸多优势,用户倾向于将数据挖掘和数据分析等业务外包到专业的云服务提供商,然而随之而来的是用户的隐私不能得到保证.目前,众多学者关注云环境下敏感数据存储的隐私保护,而隐私保护数据分析的相关研究还比较少.但是如果仅仅为了保护数据隐私,而不对大数据进行挖掘分析,大数据也就失去了其潜在的巨大价值.本文提出了一种云计算环境下基于格的隐私保护数据发布方法,利用格加密构建隐私数据的安全同态运算方法,并且在此基础上实现了支持隐私保护的云端密文数据聚类分析数据挖掘服务.为保护用户数据隐私,用户将数据加密之后发布到云服务提供商,云服务提供商利用基于格的同态加密算法实现隐私保护的k-means、隐私保护层次聚类以及隐私保护DBSCAN数据挖掘服务,但云服务提供商并不能直接访问用户数据破坏用户隐私.与现有的隐私数据发布方法相比,论文的隐私数据发布基于格的最接近向量困难问题(CVP)和最短向量困难问题(SVP),具有很高的安全性.同时算法有效保持了密文数据间距离的精确性,与现有研究相比挖掘结果也具有更高的精确性和可用性.论文对方法的安全性进行了理论分析并设计实验对提出的隐私保护数据挖掘方法效率进行评估,实验结果表明本文提出的基于格的隐私保护数据挖掘算法与现有的方法相比具有更高的数据分析精确性和更高的计算效率.  相似文献   

10.
分布式隐私保护数据挖掘研究*   总被引:2,自引:2,他引:0  
隐私保护挖掘是近年来数据挖掘领域的热点之一,主要研究在避免敏感数据泄露的同时在数据中挖掘出潜在的知识。实际应用中,大量的数据分别存放在多个站点,因此分布式隐私保护数据挖掘(distributed privacy preserving data mining, DPPDM)的研究更具有实际意义。对该领域的研究进行了详细的阐述,比较了各种方法的优缺点,对现有方法进行了分类和总结,最后指出了该领域未来的研究方向。  相似文献   

11.
Mining frequent itemsets over data streams has attracted much research attention in recent years. In the past, we had developed a hash-based approach for mining frequent itemsets over a single data stream. In this paper, we extend that approach to mine global frequent itemsets from a collection of data streams distributed at distinct remote sites. To speed up the mining process, we make the first attempt to address a new problem on continuously maintaining a global synopsis for the union of all the distributed streams. The mining results therefore can be yielded on demand by directly processing the maintained global synopsis. Instead of collecting and processing all the data in a central server, which may waste the computation resources of remote sites, distributed computations over the data streams are performed. A distributed computation framework is proposed in this paper, including two communication strategies and one merging operation. These communication strategies are designed according to an accuracy guarantee of the mining results, determining when and what the remote sites should transmit to the central server (named coordinator). On the other hand, the merging operation is exploited to merge the information received from the remote sites into the global synopsis maintained at the coordinator. By the strategies and operation, the goal of continuously maintaining the global synopsis can be achieved. Rooted in the continuously maintained global synopsis, we propose a mining algorithm for finding global frequent itemsets. Moreover, the correctness guarantees of the communication strategies and merging operation, and the accuracy guarantee analysis of the mining algorithm are provided. Finally, a series of experiments on synthetic datasets and a real dataset are performed to show the effectiveness and efficiency of the distributed computation framework.  相似文献   

12.
基于SPRINT分类算法的异构分布式数据挖掘研究   总被引:3,自引:0,他引:3  
分类算法是数据挖掘领域最重要的技术之一。随着网络的迅猛发展,分布式环境的日益普遍,分布式数据挖掘已成为近年来数据挖掘中的热点问题。针对目前的数据库多为异构式分布,提出利用SPRINT算法来进行分布式环境下的分类研究。先简要介绍了SPRINT算法,然后针对具体实例,详细探讨了分站点的预处理、计算最佳分裂、中心站点的决策树生成等几个阶段以及具体的算法设计实现过程。  相似文献   

13.
针对分布式环境中数据自治、异构和私有的特点,提出将现有数据挖掘算法分解为分布式统计信息获取和模型生成两部分.以决策树为研究对象,分析了分布式信息需求并设计了分布式挖掘算法步骤.通过性能分析,文中算法在数据自治和通信费用上比集中式算法有优势.  相似文献   

14.
Efficient monitoring of skyline queries over distributed data streams   总被引:1,自引:0,他引:1  
Data management and data mining over distributed data streams have received considerable attention within the database community recently. This paper is the first work to address skyline queries over distributed data streams, where streams derive from multiple horizontally split data sources. Skyline query returns a set of interesting objects which are not dominated by any other objects within the base dataset. Previous work is concentrated on skyline computations over static data or centralized data streams. We present an efficient and an effective algorithm called BOCS to handle this issue under a more challenging environment of distributed streams. BOCS consists of an efficient centralized algorithm GridSky and an associated communication protocol. Based on the strategy of progressive refinement in BOCS, the skyline is incrementally computed by two phases. In the first phase, local skylines on remote sites are maintained by GridSky. At each time, only skyline increments on remote sites are sent to the coordinator. In the second phase, a global skyline is obtained by integrating remote increments with the latest global skyline. A theoretical analysis shows that BOCS is communication-optimal among all algorithms which use a share-nothing strategy. Extensive experiments demonstrate that our proposals are efficient, scalable, and stable.  相似文献   

15.
Different from traditional association-rule mining, a new paradigm called Ratio Rule (RR) was proposed recently. Ratio rules are aimed at capturing the quantitative association knowledge, We extend this framework to mining ratio rules from distributed and dynamic data sources. This is a novel and challenging problem. The traditional techniques used for ratio rule mining is an eigen-system analysis which can often fall victim to noise. This has limited the application of ratio rule mining greatly. The distributed data sources impose additional constraints for the mining procedure to be robust in the presence of noise, because it is difficult to clean all the data sources in real time in real-world tasks. In addition, the traditional batch methods for ratio rule mining cannot cope with dynamic data. In this paper, we propose an integrated method to mining ratio rules from distributed and changing data sources, by first mining the ratio rules from each data source separately through a novel robust and adaptive one-pass algorithm (which is called Robust and Adaptive Ratio Rule (RARR)), and then integrating the rules of each data source in a simple probabilistic model. In this way, we can acquire the global rules from all the local information sources adaptively. We show that the RARR technique can converge to a fixed point and is robust as well. Moreover, the integration of rules is efficient and effective. Both theoretical analysis and experiments illustrate that the performance of RARR and the proposed information integration procedure is satisfactory for the purpose of discovering latent associations in distributed dynamic data source.  相似文献   

16.
Distributed data mining applies techniques to mine distributed data sources by avoiding the need to first collect the data into a central site. This has a significant appeal when issues of communication cost and privacy put a restriction on traditional centralized methods. Although there has been development on many fronts in distributed data mining, we are still lacking models that abstract the process by showing similarities and contrasts between the different methods. In this paper, we introduce two abstract models for distributed clustering in peer-to-peer environments with different goals. The first is the Locally optimized Distributed Clustering (LDC) model, which aims toward achieving better local clusters at each node, and is facilitated by collaboration through sharing of summarized cluster information. The second is the Globally optimized Distributed Clustering (GDC) model, which aims toward achieving one global clustering solution that is an approximation of centralized clustering. We also report on concrete realizations of the two models that show their benefits, through application in text mining. The LDC model is realized through the Collaborative P2P Clustering algorithm, while the GDC model is realized through the Hierarchically distributed P2P Clustering algorithm. In the former, we show that peer collaboration results in significant increase in local clustering quality. The process utilizes cluster summarization to exchange information between peers. In the latter, we target scalability by structuring the P2P network hierarchically and devise a distributed variant of the k-means algorithm to compute one set of clusters across the hierarchy. We demonstrate through experimental results the effectiveness of both methods and make recommendation on when to use each method.  相似文献   

17.
Parallel and distributed methods for incremental frequent itemset mining   总被引:3,自引:0,他引:3  
Traditional methods for data mining typically make the assumption that the data is centralized, memory-resident, and static. This assumption is no longer tenable. Such methods waste computational and input/output (I/O) resources when data is dynamic, and they impose excessive communication overhead when data is distributed. Efficient implementation of incremental data mining methods is, thus, becoming crucial for ensuring system scalability and facilitating knowledge discovery when data is dynamic and distributed. In this paper, we address this issue in the context of the important task of frequent itemset mining. We first present an efficient algorithm which dynamically maintains the required information even in the presence of data updates without examining the entire dataset. We then show how to parallelize this incremental algorithm. We also propose a distributed asynchronous algorithm, which imposes minimal communication overhead for mining distributed dynamic datasets. Our distributed approach is capable of generating local models (in which each site has a summary of its own database) as well as the global model of frequent itemsets (in which all sites have a summary of the entire database). This ability permits our approach not only to generate frequent itemsets, but also to generate high-contrast frequent itemsets, which allows one to examine how the data is skewed over different sites.  相似文献   

18.
A distributed data mining algorithm to improve the detection accuracy when classifying malicious or unauthorized network activity is presented. The algorithm is based on genetic programming (GP) extended with the ensemble paradigm. GP ensemble is particularly suitable for distributed intrusion detection because it allows to build a network profile by combining different classifiers that together provide complementary information. The main novelty of the algorithm is that data is distributed across multiple autonomous sites and the learner component acquires useful knowledge from this data in a cooperative way. The network profile is then used to predict abnormal behavior. Experiments on the KDD Cup 1999 Data show the capability of genetic programming in successfully dealing with the problem of intrusion detection on distributed data.  相似文献   

19.
FrequentItemsetMining (FIM) is one of the most important data mining tasks and is the foundation of many data mining tasks. In Big Data era, centralized FIM algorithms cannot meet the needs of FIM for big data in terms of time and space, so Distributed Frequent Itemset Mining (DFIM) algorithms have been designed to meet the above challenges. In this paper, LocalGlobal and RedistributionMining which are two main paradigms of DFIM algorithm are discussed; Two algorithms of these paradigms on MapReduce named LG and RM are proposed while MapReduce is a popular distributed computing model, and also the related work is discussed. The experimental results show that the RM algorithm has better performance in terms of computation and scalability of sites, and can be used as the basis for designing the DFIM algorithm based on MapReduce. This paper also discusses the main ideas of improving the DFIM algorithms based on MapReduce.  相似文献   

20.
基于频繁概念直乘分布的全局闭频繁项集挖掘算法   总被引:2,自引:0,他引:2  
柴玉梅  张卓  王黎明 《计算机学报》2012,35(5):990-1001
基于概念格的集中式数据挖掘算法,不能充分地利用分布式计算资源来改善概念格构造效率,从而影响了挖掘算法的性能.文中进一步分析了Iceberg概念格并置集成的内在并行特性;以频繁概念直乘及其下覆盖为最小粒度,对Iceberg概念格并置集成过程进行分解和分布式计算;在对其正确性理论证明的基础上,提出了一个新颖的异构分布式环境下闭频繁项集全局挖掘算法.此算法利用Iceberg概念格的半格以及可并置集成特性,充分发挥了分布式环境下计算资源的优势.实验证明,在稠密数据集和稀疏数据集上,该挖掘算法都表现出较好的性能.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号