首页 | 官方网站   微博 | 高级检索  
     


A new approach for generating efficient sample from market basket data
Authors:B Chandra  Shalini Bhaskar
Affiliation:1. Department of Mathematics, I.I.T. Roorkee, Roorkee, Haridwar, Uttrakhand 247 667, India;2. Department of Mathematics, Institute of Basic Science, Khandari, Agra 282 002, India;1. RCC Institute of Information Technology, Canal South Road, Beliaghata, Kolkata, West Bengal 700 015, India;2. Modern Institute of Engineering and Technology, Bandel, Hooghly, West Bengal 712123, India;3. Institute of Radio Physics and Electronics, University of Calcutta, 92 APC Road, Kolkata 700 009, India;1. Analytical Chemistry Department, CSIR-Central Institute of Medicinal and Aromatic Plants, Lucknow, India;2. Botany and Pharmacognosy Department, CSIR-Central Institute of Medicinal and Aromatic Plants, Lucknow, India;3. Division of Parasitology, CSIR-Central Drug Research Institute, Lucknow, India;1. Amity Institute of Biotechnology, Amity University, Noida, Uttar Pradesh, India;2. National Institute of Cholera and Enteric Diseases, Kolkata, West Bengal, India;3. Yashoda Superspecialty Hospital, Ghaziabad, Uttar Pradesh, India;4. Max Superspecialty Hospital, Vaishali, Ghaziabad, NCR, India
Abstract:Classical data mining algorithms require expensive passes over the entire database to generate frequent items and hence to generate association rules. With the increase in the size of database, it is becoming very difficult to handle large amount of data for computation. One of the solutions to this problem is to generate sample from the database that acts as representative of the entire database for finding association rules in such a way that the distance of the sample from the complete database is minimal. Choosing correct sample that could represent data is not an easy task. Many algorithms have been proposed in the past. Some of them are computationally fast while others give better accuracy. In this paper, we present an algorithm for generating a sample from the database that can replace the entire database for generating association rules and is aimed at keeping a balance between accuracy and speed. The algorithm that is proposed takes into account the average number of small, medium and large 1-itemset in the database and average weight of the transactions to define threshold condition for the transactions. Set of transactions that satisfy the threshold condition is chosen as the representative for the entire database. The effectiveness of the proposed algorithm has been tested over several runs of database generated by IBM synthetic data generator. A vivid comparative performance evaluation of the proposed technique with the existing sampling techniques for comparing the accuracy and speed has also been carried out.
Keywords:
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号