Locality sensitive hashing for sampling-based algorithms in association rule mining期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Locality sensitive hashing for sampling-based algorithms in association rule mining

Authors:	Chyouhwa Chen Shi-Jinn Horng Chin-Pin Huang

Affiliation:	1. Department of Civil Engineering, Universidad Nacional de Colombia at Medellín, Calle 65 No. 78-28, M1-110 Medellin, Colombia;2. Director of the VREF Center of Excellence on Sustainable Urban Freight Systems (COE-SUFS), Department of Civil and Environmental Engineering, Rensselaer Polytechnic Institute, 110 8th St. JEC 4030, Troy, NY 12180, USA;3. Department of Supply Chain Management, Iowa State University, 2167 Union Drive, Ames, IA 50011, USA;4. Department of Technology Management and Economics, Chalmers University of Technology, Maskingränd 2, Göteborg 41258, Sweden;1. Department of Chemical Engineering, University of California, Davis, Davis, CA 95616, USA;2. Department of Materials Science and Engineering, University of California, Davis, Davis, CA 95616, USA;1. Cheriton School of Computer Science, University of Waterloo, Canada;2. Department of Electrical Engineering & Computer Science, University of Kansas, United States

Abstract:	Association rule mining is one of the most important techniques for intelligent system design and has been widely applied in a large number of real applications. However, classical mining algorithms cannot process very large databases in a reasonable amount of time. The sampling approach that processes a subset of the whole database is a viable alternative. Obviously, such an approach cannot extract perfectly accurate rules. Previous works have tried to improve the accuracy by removing “outliers” from the initial sample based on global statistical properties in the sample. In this paper, we take the view that the initial sample may actually consist of multiple possibly overlapping subsets or clusters. It is more reasonable to apply data clustering techniques to the initial sample before outlier removal is performed on the resulting clusters, so that outliers are removed based on local properties of individual clusters. However, clustering transactional data with very high dimensions is a difficult problem by itself. We solve this problem by interpreting locality sensitive hashing as a means for data clustering. Previously proposed algorithms may be then optionally used to remove the outliers in the individual clusters. We propose several concrete algorithms based on this general strategy. Using an extensive set of synthetic data and real datasets, we evaluate our proposed algorithms and find that our proposals exhibit better accuracy or execution time, or both, than previously proposed algorithms.

Keywords:
本文献已被 ScienceDirect 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏