首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Mining association rules is an important task for knowledge discovery. We can analyze past transaction data to discover customer behaviors such that the quality of business decisions can be improved. Various types of association rules may exist in a large database of customer transactions. The strategy of mining association rules focuses on discovering large item sets, which are groups of items which appear together in a sufficient number of transactions. We propose a graph-based approach to generate various types of association rules from a large database of customer transactions. This approach scans the database once to construct an association graph and then traverses the graph to generate all large item sets. Empirical evaluations show that our algorithms outperform other algorithms which need to make multiple passes over the database  相似文献   

2.
Mining sequential patterns means to discover sequential purchasing behaviors of most customers from a large number of customer transactions. Past transaction data can be analyzed to discover customer purchasing behaviors such that the quality of business decisions can be improved. However, the size of the transaction database can be very large. It is very time consuming to find all the sequential patterns from a large database, and users may be only interested in some sequential patterns. Moreover, the criteria of the discovered sequential patterns for user requirements may not be the same. Many uninteresting sequential patterns for user requirements can be generated when traditional mining methods are applied. Hence, a data mining language needs to be provided such that users can query only knowledge of interest to them from a large database of customer transactions. In this article, a data mining language is presented. From the data mining language, users can specify the items of interest and the criteria of the sequential patterns to be discovered. Also, an efficient data mining technique is proposed to extract the sequential patterns according to the users' requests. © 2005 Wiley Periodicals, Inc. Int J Int Syst 20: 73–87, 2005.  相似文献   

3.
《Information Systems》2001,26(1):1-14
In this paper, we examine the two issues of mining association rules and mining sequential patterns in a large database of sales transactions. The problems of mining association rules and mining sequential patterns focus on discovering large itemsets and large sequences, respectively. We present PSI and PSI_seq for efficient large itemsets generation and large sequences generation, respectively. The main ideas of these two algorithms are using prestored information to minimize the numbers of candidate itemsets and candidate sequences counted in each database scan. The prestored informations for PSI and PSI_seq include the itemsets and the sequences along with their support counts found in the last mining, respectively. Typically a user may require to tune the value of the minimum support many times before a set of useful association rules can be obtained from the transaction database. Using prestored information, the total computation time will be reduced effectively. Empirical results show that our approaches outperform previous methods by an order of magnitude, using little storage space for the prestored information.  相似文献   

4.
Mining useful information and helpful knowledge from large databases has evolved into an important research area in recent years. Among the classes of knowledge derived, finding sequential patterns in temporal transaction databases is very important since it can help model customer behavior. In the past, researchers usually assumed databases were static to simplify data-mining problems. In real-world applications, new transactions may be added into databases frequently. Designing an efficient and effective mining algorithm that can maintain sequential patterns as a database grows is thus important. In this paper, we propose a novel incremental mining algorithm for maintaining sequential patterns based on the concept of pre-large sequences to reduce the need for rescanning original databases. Pre-large sequences are defined by a lower support threshold and an upper support threshold that act as gaps to avoid the movements of sequences directly from large to small and vice versa. The proposed algorithm does not require rescanning original databases until the accumulative amount of newly added customer sequences exceeds a safety bound, which depends on database size. Thus, as databases grow larger, the numbers of new transactions allowed before database rescanning is required also grow. The proposed approach thus becomes increasingly efficient as databases grow.  相似文献   

5.
In this paper, we propose a novel algorithm for mining frequent sequences from transaction databases. The transactions of the same customers form a set of customer sequences. A sequence (an ordered list of itemsets) is frequent if the number of customer sequences containing it satisfies the user-specified threshold. The 1-sequence is a special type of sequences because it consists of only a single itemset instead of an ordered list, while the k-sequence is a sequence composed of k itemsets. Compared with the cost of mining frequent k-sequences (k ≥ 2), the cost of mining frequent 1-sequences is negligible. We adopt a two-phase architecture to find the two types of frequent sequences separately in order that the discovery of frequent k-sequences can be well designed and optimized. For efficient frequent k-sequence mining, every frequent 1-sequence is encoded as a unique symbol and the database is transformed into one constituted by the symbols. We find that it is unnecessary to encode all the frequent 1-seqences, and make full use of the discovered frequent 1-sequences to transform the database into one with a smaller size. For every k ≥ 2, the customer sequences in the transformed database are scanned to find all the frequent k-sequences. We devise the compact representation for a customer sequence and elaborate the method to enumerate all distinct subsequences from a customer sequence without redundant scans. The soundness of the proposed approach is verified and a number of experiments are performed. The results show that our approach outperforms the previous works in both scalability and execution time.  相似文献   

6.
Traditional frequent pattern mining methods consider an equal profit/weight for all items and only binary occurrences (0/1) of the items in transactions. High utility pattern mining becomes a very important research issue in data mining by considering the non-binary frequency values of items in transactions and different profit values for each item. However, most of the existing high utility pattern mining algorithms suffer in the level-wise candidate generation-and-test problem and generate too many candidate patterns. Moreover, they need several database scans which are directly dependent on the maximum candidate length. In this paper, we present a novel tree-based candidate pruning technique, called HUC-Prune (High Utility Candidates Prune), to solve these problems. Our technique uses a novel tree structure, called HUC-tree (High Utility Candidates tree), to capture important utility information of the candidate patterns. HUC-Prune avoids the level-wise candidate generation process by adopting a pattern growth approach. In contrast to the existing algorithms, its number of database scans is completely independent of the maximum candidate length. Extensive experimental results show that our algorithm is very efficient for high utility pattern mining and it outperforms the existing algorithms.  相似文献   

7.
Inter-sequence pattern mining can find associations across several sequences in a sequence database, which can discover both a sequential pattern within a transaction and sequential patterns across several different transactions. However, inter-sequence pattern mining algorithms usually generate a large number of recurrent frequent patterns. We have observed mining closed inter-sequence patterns instead of frequent ones can lead to a more compact yet complete result set. Therefore, in this paper, we propose a model of closed inter-sequence pattern mining and an efficient algorithm called CISP-Miner for mining such patterns, which enumerates closed inter-sequence patterns recursively along a search tree in a depth-first search manner. In addition, several effective pruning strategies and closure checking schemes are designed to reduce the search space and thus accelerate the algorithm. Our experiment results demonstrate that the proposed CISP-Miner algorithm is very efficient and outperforms a compared EISP-Miner algorithm in most cases.  相似文献   

8.
Advanced personalized e-applications require comprehensive knowledge about their users’ likes and dislikes in order to provide individual product recommendations, personal customer advice, and custom-tailored product offers. In our approach we model such preferences as strict partial orders with “A is better than B” semantics, which has been proven to be very suitable in various e-applications. In this paper we present preference mining techniques for detecting strict partial order preferences in user log data. Real-life e-applications like online shops or financial services usually have large log data sets containing the transactions of their customers. Since the preference miner uses sophisticated SQL operations to execute all data intensive operations on database layer, our algorithms scale well even for such large log data sets. With preference mining personalized e-applications can gain valuable knowledge about their customers’ preferences, which can be applied for personalized product recommendations, individual customer service, or one-to-one marketing.  相似文献   

9.
Mining sequential patterns by pattern-growth: the PrefixSpan approach   总被引:12,自引:0,他引:12  
Sequential pattern mining is an important data mining problem with broad applications. However, it is also a difficult problem since the mining may have to generate or examine a combinatorially explosive number of intermediate subsequences. Most of the previously developed sequential pattern mining methods, such as GSP, explore a candidate generation-and-test approach [R. Agrawal et al. (1994)] to reduce the number of candidates to be examined. However, this approach may not be efficient in mining large sequence databases having numerous patterns and/or long patterns. In this paper, we propose a projection-based, sequential pattern-growth approach for efficient mining of sequential patterns. In this approach, a sequence database is recursively projected into a set of smaller projected databases, and sequential patterns are grown in each projected database by exploring only locally frequent fragments. Based on an initial study of the pattern growth-based sequential pattern mining, FreeSpan [J. Han et al. (2000)], we propose a more efficient method, called PSP, which offers ordered growth and reduced projected databases. To further improve the performance, a pseudoprojection technique is developed in PrefixSpan. A comprehensive performance study shows that PrefixSpan, in most cases, outperforms the a priori-based algorithm GSP, FreeSpan, and SPADE [M. Zaki, (2001)] (a sequential pattern mining algorithm that adopts vertical data format), and PrefixSpan integrated with pseudoprojection is the fastest among all the tested algorithms. Furthermore, this mining methodology can be extended to mining sequential patterns with user-specified constraints. The high promise of the pattern-growth approach may lead to its further extension toward efficient mining of other kinds of frequent patterns, such as frequent substructures.  相似文献   

10.
Mining sequential patterns from large databases has been recognized by many researchers as an attractive task of data mining and knowledge discovery.Previous algorithms scan the databases for many times,which is often unendurable due to the very large amount of databases.In this paper,the authors introduce an effective algorithm for mining sequential patterns from large databases.In the algorithm,the original database is not used at all for counting the support of sequences after the first pass.Rather,a tidlist structure generated in the previous pass is employed for the purpose based on set intersection operations,avoiding the multiple scans of the databases.  相似文献   

11.
Mining association rules and mining sequential patterns both are to discover customer purchasing behaviors from a transaction database, such that the quality of business decision can be improved. However, the size of the transaction database can be very large. It is very time consuming to find all the association rules and sequential patterns from a large database, and users may be only interested in some information.

Moreover, the criteria of the discovered association rules and sequential patterns for the user requirements may not be the same. Many uninteresting information for the user requirements can be generated when traditional mining methods are applied. Hence, a data mining language needs to be provided such that users can query only interesting knowledge to them from a large database of customer transactions. In this paper, a data mining language is presented. From the data mining language, users can specify the interested items and the criteria of the association rules or sequential patterns to be discovered. Also, the efficient data mining techniques are proposed to extract the association rules and the sequential patterns according to the user requirements.  相似文献   


12.
一种基于已存信息的序列模式挖掘更新方法   总被引:2,自引:0,他引:2  
在挖掘序列模式过程中,用户需要多次调整(增加或减少)最小支持度,才能从事务数据库中获得有趣序列模式。文章给出了一个利用已存信息有效产生大序列的PSI-seq算法,它能显著地减少每次扫描数据库时候选序列的计算,从而,提高挖掘的效率。  相似文献   

13.
In the present scenario of global economy and World Wide Web, large sets of evolving and distributed data can be handled efficiently by incremental data mining. Frequent patterns are very important in knowledge discovery and data mining process, such as mining of association rules, correlations. FP-tree is a very versatile data structure used for mining of frequent patterns in knowledge discovery and data mining process. FP-tree is a compact representation of transaction database that contains frequency information of all relevant frequent patterns (FP) of the database. All of the existing incremental frequent pattern mining algorithms, such as AFPIM, CATS, CanTree, CP-tree, and SPO-tree, perform incremental mining by processing one transaction of the incremental part of database at a time and updating it to the FP-tree of initial (original) database. Here, in this paper, we propose a novel method that takes advantage of FP-tree representation of incremental transaction database for incremental mining. We propose a batch incremental processing algorithm BIT_FPGrowth that restructures and merges two small consecutive duration FP-trees to obtain a FP-tree of the FP-Growth algorithm. Our BIT_FPGrowth uses FP-tree as preprocessed data repository to get transactions (i.e., item-sets), unlike other sequential incremental algorithms that read transactions from database. BIT_FPGrowth algorithm takes less time for constructing FP-tree. Our experimental results show that, as the size of the database increases, increase in runtime of BIT_FPGrowth is much less and is least of all the other algorithms.  相似文献   

14.
In this paper, we study the issues of mining and maintaining association rules in a large database of customer transactions. The problem of mining association rules can be mapped into the problems of finding large itemsets which are sets of items brought together in a sufficient number of transactions. We revise a graph-based algorithm to further speed up the process of itemset generation. In addition, we extend our revised algorithm to maintain discovered association rules when incremental or decremental updates are made to the databases. Experimental results show the efficiency of our algorithms. The revised algorithm is a significant improvement over the original one on mining association rules. The algorithms for maintaining association rules are more efficient than re-running the mining algorithms for the whole updated database and outperform previously proposed algorithms that need multiple passes over the database. Received 4 August 1999 / Revised 18 March 2000 / Accepted in revised form 18 October 2000  相似文献   

15.
使用序列模式精简基挖掘序列模式   总被引:3,自引:1,他引:3  
传统的序列模式挖掘方法在挖掘由短的频繁序列模式组成的数据库时有良好的性能.但在挖掘长的序列模式或支持度阈值很低时,这些方法可能遇到固有的困难,因为产生的频繁序列模式的数量经常太大.在许多情况下,用户可能只需要那些覆盖许多短模式的长模式.此外,在很多应用中,只要得到产生的频繁序列模式的近似支持度就已足够,而不需要它们的精确支持度.介绍了能将误差控制在确定范围内的频繁序列模式精简基的概念,并开发了一个挖掘这种序列模式精简基的算法.实验结果显示计算频繁序列模式精简基是很有前途的.  相似文献   

16.
高效用序列模式挖掘是数据挖掘领域的一项重要内容, 在生物信息学、消费行为分析等方面具有重要的应用.与传统基于频繁项模式挖掘方法不同, 高效用序列模式挖掘不仅考虑项集的内外效用, 更突出项集的时间序列含义, 计算复杂度较高.尽管已经有一定数量的算法被提出应用于解决该类问题, 挖掘算法的时空效率依然成为该领域的主要研究热点问题.鉴于此, 本文提出一个基于模式增长的高效用序列模式挖掘算法HUSP-FP.依据高效用序列项集必须满足事务效用闭包属性要求, 算法首先在去除无用项后建立全局树, 进而采用模式增长方法从全局树上获取全部高效用序列模式, 避免产生候选项集. 在实验环节与目前效率较好的HUSP-Miner、USPAN、HUS-Span三类算法进行了时空计算对比, 实验结果表明本文给出算法在较小阈值下仍能有效挖掘到相关序列模式, 并且在计算时间和空间使用效率两方面取得了较大的提高.  相似文献   

17.
Point and click at web pages generate continuous data sequences, which flow into the web log data, causing the need to update previously mined web sequential patterns. Algorithms for mining web sequential patterns from scratch include WAP, PLWAP and Apriori-based GSP. Reusing old patterns with only recent additional data sequences in an incremental fashion, when updating patterns, would achieve fast response time with reasonable memory space usage. This paper proposes two algorithms, RePL4UP (Revised PLWAP For UPdate), and PL4UP (PLWAP For UPdate), which use the PLWAP tree structure to incrementally update web sequential patterns efficiently without scanning the whole database even when previous small items become frequent. The RePL4UP concisely stores the position codes of small items in the database sequences in its metadata during tree construction. During mining, RePL4UP scans only the new additional database sequences, revises the old PLWAP tree to restore information on previous small items that have become frequent, while it deletes previous frequent items that have become small using the small item position codes. PL4UP initially builds a bigger PLWAP tree that includes all sequences in the database using a tolerance support, t, that is lower than the regular minimum support, s. The position code features of the PLWAP tree are used to efficiently mine these trees to extract current frequent patterns when the database is updated. These approaches more quickly update old frequent patterns without the need to re-scan the entire updated database.  相似文献   

18.
Efficient data mining for path traversal patterns   总被引:16,自引:0,他引:16  
The authors explore a new data mining capability that involves mining path traversal patterns in a distributed information-providing environment where documents or objects are linked together to facilitate interactive access. The solution procedure consists of two steps. First, they derive an algorithm to convert the original sequence of log data into a set of maximal forward references. By doing so, one can filter out the effect of some backward references, which are mainly made for ease of traveling and concentrate on mining meaningful user access sequences. Second, they derive algorithms to determine the frequent traversal patterns-i.e., large reference sequences-from the maximal forward references obtained. Two algorithms are devised for determining large reference sequences; one is based on some hashing and pruning techniques, and the other is further improved with the option of determining large reference sequences in batch so as to reduce the number of database scans required. Performance of these two methods is comparatively analyzed. It is shown that the option of selective scan is very advantageous and can lead to prominent performance improvement. Sensitivity analysis on various parameters is conducted  相似文献   

19.
直接对生物序列进行频繁模式挖掘会产生很多冗余模式,闭合模式更能表达出序列的功能和结构。根据生物序列的特点,提出了基于相邻闭合频繁模式段的模式挖掘算法-JCPS。首先产生闭合相邻频繁模式段,然后对这些闭合频繁模式段进行组合,同时进行闭合检测,产生新的闭合频繁模式。通过对真实的蛋白质序列家族库的处理,证明该算法能有效处理生物序列数据。  相似文献   

20.
维护已发现的序列模式的方法主要有两种:一种是简单地利用已有的挖掘序列模式算法对更新后的整个数据库进行操作,这种方法涉及的数据库中的数据不仅有改变的部分而且有未改变的部分,而未改变的数据数量很大,当更新频率高时,代价是非常大的;另一种方法是根据库中记录数目改变的多少来决定何时对整个数据库进行操作,但是记录数目数据并不能代表序列模式化亦大,因此利用样品抽样的方法来评估序列模式改变的程度,并根据改变的程度决定何时对整个数据库进行操作来更新序列模式,从而较好地解决序列模式维护的问题,能高效地、准确地发现序列模式。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号