首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Parallel Algorithms for Discovery of Association Rules   总被引:2,自引:0,他引:2  
Discovery of association rules is an important data mining task. Several parallel and sequential algorithms have been proposed in the literature to solve this problem. Almost all of these algorithms make repeated passes over the database to determine the set of frequent itemsets (a subset of database items), thus incurring high I/O overhead. In the parallel case, most algorithms perform a sum-reduction at the end of each pass to construct the global counts, also incurring high synchronization cost. In this paper we describe new parallel association mining algorithms. The algorithms use novel itemset clustering techniques to approximate the set of potentially maximal frequent itemsets. Once this set has been identified, the algorithms make use of efficient traversal techniques to generate the frequent itemsets contained in each cluster. We propose two clustering schemes based on equivalence classes and maximal hypergraph cliques, and study two lattice traversal techniques based on bottom-up and hybrid search. We use a vertical database layout to cluster related transactions together. The database is also selectively replicated so that the portion of the database needed for the computation of associations is local to each processor. After the initial set-up phase, the algorithms do not need any further communication or synchronization. The algorithms minimize I/O overheads by scanning the local database portion only twice. Once in the set-up phase, and once when processing the itemset clusters. Unlike previous parallel approaches, the algorithms use simple intersection operations to compute frequent itemsets and do not have to maintain or search complex hash structures. Our experimental testbed is a 32-processor DEC Alpha cluster inter-connected by the Memory Channel network. We present results on the performance of our algorithms on various databases, and compare it against a well known parallel algorithm. The best new algorithm outperforms it by an order of magnitude.  相似文献   

2.
MAFIA: a maximal frequent itemset algorithm   总被引:4,自引:0,他引:4  
We present a new algorithm for mining maximal frequent itemsets from a transactional database. The search strategy of the algorithm integrates a depth-first traversal of the itemset lattice with effective pruning mechanisms that significantly improve mining performance. Our implementation for support counting combines a vertical bitmap representation of the data with an efficient bitmap compression scheme. In a thorough experimental analysis, we isolate the effects of individual components of MAFIA including search space pruning techniques and adaptive compression. We also compare our performance with previous work by running tests on very different types of data sets. Our experiments show that MAFIA performs best when mining long itemsets and outperforms other algorithms on dense data by a factor of three to 30.  相似文献   

3.
频繁项集挖掘的研究与进展   总被引:6,自引:0,他引:6  
挖掘频繁项集是许多数据挖掘任务中的关键问题,也是关联规则挖掘算法的核心,所以提高频繁项集的生成效率一直是近几年数据挖掘领域研究的热点之一,研究人员从不同的角度对算法进行改进以提高算法的效率。该文从频繁项集生成过程中解空间的类型、搜索方法和剪枝策略、数据库的表示方法、数据压缩技术等几个方面对频繁项集挖掘的基本策略进行了研究,对完全频繁项集挖掘、频繁闭项集挖掘和最大频繁项集挖掘的典型算法特别是最新算法进行了介绍和评述,并分析了各种算法的性能特点,指出其适于哪种类型的数据集。最后,对频繁项集挖掘算法的发展方向进行了初步的探讨。  相似文献   

4.
频繁项集挖掘算法综述   总被引:4,自引:0,他引:4  
该文基于频繁项集挖掘算法的研究现状,采用自底向上遍历搜索、自顶向下遍历搜索和混合遍历搜索的分类方法,对现有的频繁项集挖掘算法进行归纳分类,分析和比较了各类别中具有代表性的挖掘算法,总结每种算法各方面的特性.同时,对一些特殊的频繁项集挖掘算法也作了简单介绍.旨在使读者全面掌握频繁项集挖掘算法目前的研究水平,便于研究者对已有的算法进行改进,提出具有更好性能的新的分类算法,也便于使用者在应用时对算法的选择和使用.  相似文献   

5.
A survey on algorithms for mining frequent itemsets over data streams   总被引:9,自引:8,他引:1  
The increasing prominence of data streams arising in a wide range of advanced applications such as fraud detection and trend learning has led to the study of online mining of frequent itemsets (FIs). Unlike mining static databases, mining data streams poses many new challenges. In addition to the one-scan nature, the unbounded memory requirement and the high data arrival rate of data streams, the combinatorial explosion of itemsets exacerbates the mining task. The high complexity of the FI mining problem hinders the application of the stream mining techniques. We recognize that a critical review of existing techniques is needed in order to design and develop efficient mining algorithms and data structures that are able to match the processing rate of the mining with the high arrival rate of data streams. Within a unifying set of notations and terminologies, we describe in this paper the efforts and main techniques for mining data streams and present a comprehensive survey of a number of the state-of-the-art algorithms on mining frequent itemsets over data streams. We classify the stream-mining techniques into two categories based on the window model that they adopt in order to provide insights into how and why the techniques are useful. Then, we further analyze the algorithms according to whether they are exact or approximate and, for approximate approaches, whether they are false-positive or false-negative. We also discuss various interesting issues, including the merits and limitations in existing research and substantive areas for future research.  相似文献   

6.
Fast algorithms for frequent itemset mining using FP-trees   总被引:9,自引:0,他引:9  
Efficient algorithms for mining frequent itemsets are crucial for mining association rules as well as for many other data mining tasks. Methods for mining frequent itemsets have been implemented using a prefix-tree structure, known as an FP-tree, for storing compressed information about frequent itemsets. Numerous experimental results have demonstrated that these algorithms perform extremely well. In this paper, we present a novel FP-array technique that greatly reduces the need to traverse FP-trees, thus obtaining significantly improved performance for FP-tree-based algorithms. Our technique works especially well for sparse data sets. Furthermore, we present new algorithms for mining all, maximal, and closed frequent itemsets. Our algorithms use the FP-tree data structure in combination with the FP-array technique efficiently and incorporate various optimization techniques. We also present experimental results comparing our methods with existing algorithms. The results show that our methods are the fastest for many cases. Even though the algorithms consume much memory when the data sets are sparse, they are still the fastest ones when the minimum support is low. Moreover, they are always among the fastest algorithms and consume less memory than other methods when the data sets are dense.  相似文献   

7.
关联规则挖掘是近年来数据挖掘领域中一个相当活跃的领域,频繁项集挖掘是关联规则挖掘中最重要的任务。最大频繁项集的规模远远小于频繁项集的规模,通过最大频繁项集可以导出所有的频繁项集,因此进行了很多专门挖掘最大频繁项集的研究。给出了关联规则和相关术语的基本概念,对最大频繁项集挖掘算法作了分析与评价,便于研究者对已有的算法进行改进,提出具有更好性能的新算法。  相似文献   

8.
Discovery of frequent DATALOG patterns   总被引:19,自引:0,他引:19  
Discovery of frequent patterns has been studied in a variety of data mining settings. In its simplest form, known from association rule mining, the task is to discover all frequent itemsets, i.e., all combinations of items that are found in a sufficient number of examples. The fundamental task of association rule and frequent set discovery has been extended in various directions, allowing more useful patterns to be discovered with special purpose algorithms. We present WARMR, a general purpose inductive logic programming algorithm that addresses frequent query discovery: a very general DATALOG formulation of the frequent pattern discovery problem.The motivation for this novel approach is twofold. First, exploratory data mining is well supported: WARMR offers the flexibility required to experiment with standard and in particular novel settings not supported by special purpose algorithms. Also, application prototypes based on WARMR can be used as benchmarks in the comparison and evaluation of new special purpose algorithms. Second, the unified representation gives insight to the blurred picture of the frequent pattern discovery domain. Within the DATALOG formulation a number of dimensions appear that relink diverged settings.We demonstrate the frequent query approach and its use on two applications, one in alarm analysis, and one in a chemical toxicology domain.  相似文献   

9.
基于iceberg概念格并置集成的闭频繁项集挖掘算法   总被引:2,自引:0,他引:2  
由于概念格的完备性,在基于概念格的数据挖掘过程中,构造概念格的时间复杂度和空间复杂度一直是影响其应用的主要因素.结合iceberg概念格的半格特性和概念格的集成思想,首先在理论上分析并置集成后的iceberg概念格与由完备概念格裁剪得到的iceberg格同构;然后分析了iceberg概念格集成过程中的映射关系;最终提出一个新颖的基于iceberg概念格并置的闭频繁项集挖掘算法(Icegalamera).此算法避免了完备概念格的计算,并且在构造过程中采用集成和剪枝策略,从而显著提高了挖掘效率.实验证明其产生的闭频繁项集的完备性.使用稠密和稀疏数据集在单站点模式下进行了性能测试,结果表明稀疏数据集上性能优势明显.  相似文献   

10.
Incremental mining has attracted the attention of many researchers due to its usefulness in online applications. Many algorithms have thus been proposed for incrementally mining frequent itemsets. Maintaining a frequent-itemset lattice (FIL) is difficult for databases with large numbers of frequent itemsets, especially huge databases, due to the storage of links of nodes in the lattice. However, generating association rules from a FIL has been shown to be more effective than traditional methods such as directly generating rules from frequent itemsets or frequent closed itemsets. Therefore, when the number of frequent itemsets is not huge (i.e., they can be stored in the lattice without excessive memory overhead), the lattice-based approach outperforms approaches which mine association rules from frequent itemsets/frequent closed itemsets. However, incremental algorithms for building FILs have not yet been proposed. This paper proposes an effective approach for the maintenance of a FIL based on the pre-large concept in incremental mining. The building process of a FIL is first improved using two proposed theorems regarding the paternity relation between two nodes in the lattice. An effective approach for maintaining a FIL with dynamically inserted data is then proposed based on the pre-large and the diffset concepts. The experimental results show that the proposed approach outperforms the batch approach for building a FIL in terms of execution time.  相似文献   

11.
Frequent itemset mining is an important problem in the data mining area with a wide range of applications. Many decision support systems need to support online interactive frequent itemset mining, which is a challenging task because frequent itemset mining is a computation intensive repetitive process. One solution is to precompute frequent itemsets. In this paper, we propose a compact disk-based data structure—CFP-tree to store precomputed frequent itemsets on a disk to support online mining requests. The CFP-tree structure effectively utilizes the redundancy in frequent itemsets to save space. The compressing ratio of a CFP-tree can be as high as several thousands or even higher. Efficient algorithms for retrieving frequent itemsets from a CFP-tree, as well as efficient algorithms to construct and maintain a CFP-tree, are developed. Our performance study demonstrates that with a CFP-tree, frequent itemset mining requests can be responded to promptly.  相似文献   

12.
基于FP树的全局最大频繁项集挖掘算法   总被引:12,自引:1,他引:12  
挖掘最大频繁项集是多种数据挖掘应用了更新最大频繁候选项集集合,需要反复地扫描整个数据库,而且大部分算法是单机算法,全局最大频繁项集挖掘算法并不多见.为此提出MGMF算法,该算法利用FP-树结构,类似FP-树挖掘方法,一遍就可以挖掘出所有的最大频繁项集,并且超集检测非常简单、快捷.另外MGMF算法采用了分布式PDDM算法播报消息的思想,具有很好的拓展性和并行性.实验证明MGMF算法是有效可行的.  相似文献   

13.
最大频繁项集挖掘算法存在扫描数据集次数多和候选集规模过大等局限。基于Iceberg概念格模型,提出一种在Iceberg概念格上挖掘最大频繁项集的算法ICMFIA。该算法通过一次扫描数据集构建Iceberg概念格,利用Iceberg概念格中频繁概念之间良好的覆盖关系能快速计算出最大频繁项集所对应的最大频繁概念,所有最大频繁概念的内涵就是所求的最大频繁项集的集合。实验结果表明,该算法具有扫描数据集次数少和挖掘效率高的优点。  相似文献   

14.
基于Iceberg概念格叠置半集成的全局闭频繁项集挖掘算法   总被引:2,自引:0,他引:2  
研究专有的分布式数据挖掘算法是提高分布式数据库下数据分析和挖掘的有效方法.结合Iceberg概念格对于频繁项集精简表达的特性和其集成构造过程可并行化的特点,进而实现分布式全局闭频繁项集的挖掘.面对目前仍然缺乏有关Iceberg概念格分布式集成构造研究的文献,本文从理论上分析Iceberg概念格叠置集成构造全局Iceberg概念格的局限性,然后论证了基于Iceberg概念格叠置半集成构造全局Iceberg概念格的可行性,进而提出一个基于Iceberg概念格叠置半集成的频繁概念生长分布算法(Frecogd),并且把它应用于同构分布式环境下的全局闭频繁项集挖掘过程中.实验验证了该算法理论的可行性,同时也揭示了该算法的挖掘效能有待进一步的改进与提高.  相似文献   

15.
目前已提出了许多快速的关联规则挖掘算法,实际上用户只关心部分关联规则,如他们仅想知道包含指定项目的规则.当这些约束被用于数据预处理或将它结合到数据挖掘算法中去时,可以显著减少算法的执行时间.为此,考虑了一类包含或不包含某些项目的布尔表达式约束条件,提出了一种快速的基于FP—tree的约束最大频繁项目集挖掘算法CMFIMA,并对其更新问题进行了研究,提出了一种增量式更新约束最大频繁项目集挖掘算法CMFIUA.  相似文献   

16.
《Information Systems》2002,27(1):41-74
Most algorithms for association rule mining are variants of the basic Apriori algorithm (Agarwal and Srikant, Fast algorithms for mining association rules in databases, in: Proceedings of the 20th International Conference on Very Large Data Bases (VLDB’94), Santiago, Chile, 1994, pp. 487–499). One characteristic of these Apriori-based algorithms is that candidate itemsets are generated in rounds, with the size of the itemsets incremented by one per round. The number of database scans required by Apriori-based algorithms thus depends on the size of the biggest frequent itemsets. In this paper, we devise a more general candidate set generation algorithm, LGen, which generates candidate itemsets of multiple sizes during each database scan. We present an algorithm FindLarge which uses LGen to find frequent itemsets. We show that, given a reasonable set of suggested frequent itemsets, FindLarge can significantly reduce the number of I/O passes required. In the best cases, only two passes are sufficient to discover all the frequent itemsets irrespective of the size of the biggest ones.Two I/O-saving algorithms, namely DIC and Pincher-Search, are compared with FindLarge in a series of experiments. We discuss the conditions under which FindLarge significantly outperforms the others in terms of I/O efficiency.  相似文献   

17.
挖掘频繁项集是挖掘数据流的基本任务.许多近似算法能够对数据流进行频繁项集的挖掘,但不能有效控制内存资源消耗和挖掘运行时间.为了提高数据流挖掘的效率,通过挖掘数据流中的频繁闭项集来减少挖掘结果项集的数量,并借鉴Relim算法和Manku算法,引入事务链表组作为概要数据结构,提出了一种新的数据流频繁闭项集的挖掘算法.最后通过实验,证明了该算法的有效性.  相似文献   

18.
Mining frequent itemsets is an essential problem in data mining and plays an important role in many data mining applications. In recent years, some itemset representations based on node sets have been proposed, which have shown to be very efficient for mining frequent itemsets. In this paper, we propose DiffNodeset, a novel and more efficient itemset representation, for mining frequent itemsets. Based on the DiffNodeset structure, we present an efficient algorithm, named dFIN, to mining frequent itemsets. To achieve high efficiency, dFIN finds frequent itemsets using a set-enumeration tree with a hybrid search strategy and directly enumerates frequent itemsets without candidate generation under some case. For evaluating the performance of dFIN, we have conduct extensive experiments to compare it against with existing leading algorithms on a variety of real and synthetic datasets. The experimental results show that dFIN is significantly faster than these leading algorithms.  相似文献   

19.
数据流中基于矩阵的频繁项集挖掘   总被引:3,自引:0,他引:3  
挖掘频繁项集是挖掘数据流的基本任务。许多近似算法能够有效地对数据流进行频繁项挖掘,但不能有效地控制内存资源消耗和挖掘运行时间。为了提高数据流频繁项集挖掘的时空效率,通过引入矩阵作为概要数据结构,提出了一种新的数据流频繁项集挖掘算法。最后通过实验证明了该算法的有效性。  相似文献   

20.
频繁项集挖掘算法是关联规则挖掘问题的关键,是数据挖掘领域的一个研究热点.自从Apriori算法提出至今,学者提出来大量的关于频繁项集挖掘的算法.本文按照挖掘方式将这些算法分成三类,即宽度优先、深度优先、宽度和深度相结合,并对每类算法进行了全面的综述及深入的分析,并给出了以后的研究方向.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号