首页 | 官方网站   微博 | 高级检索  
     

Efficient Computation of k-Medians over Data Streams Under Memory Constraints
作者姓名:Zhi-Hong Chong  Jeffrey Xu Yu  Zhen-Jie Zhang  Xue-Min Lin  Wei Wang  and Ao-Ying Zhou
作者单位:[1]Department of Computer Science and Engineering, Fudan University, Shanghai 200433, P.R. China [2]Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin N. T Hong Kong Special Administration Region, P.R. China [3]School of Computing, National University of Singapore, 117574 Singapore [4]School of Computer Science and Engineering, University of New South Wales, Sydney 2052, Australia [5]Department of Computer Science and Engineering, Southeast University, Nanjing 210096, P.R. China
基金项目:This work is supported by the National High Technology Development 863 Program of China under Grant No. 2002AA413310 and ARC Discovery Grant, Australia under Grant Nos. DP0346004 and DP0345710.
摘    要:In this paper, we study the problem of efficiently computing k-medians over high-dimensional and high speed data streams. The focus of this paper is on the issue of minimizing CPU time to handle high speed data streams on top of the requirements of high accuracy and small memory. Our work is motivated by the following observation: the existing algorithms have similar approximation behaviors in practice, even though they make noticeably different worst case theoretical guarantees. The underlying reason is that in order to achieve high approximation level with the smallest possible memory, they need rather complex techniques to maintain a sketch, along time dimension, by using some existing off-line clustering algorithms. Those clustering algorithms cannot guarantee the optimal clustering result over data segments in a data stream but accumulate errors over segments, which makes most algorithms behave the same in terms of approximation level, in practice. We propose a new grid-based approach which divides the entire data set into cells (not along time dimension). We can achieve high approximation level based on a novel concept called (1 - ε)-dominant. We further extend the method to the data stream context, by leveraging a density-based heuristic and frequent item mining techniques over data streams. We only need to apply an existing clustering once to computing k-medians, on demand, which reduces CPU time significantly. We conducted extensive experimental studies, and show that our approaches outperform other well-known approaches.

关 键 词:数据流  数据采集  中间值  数据存储
收稿时间:2005-10-08
修稿时间:2005-10-082005-12-13

Efficient Computation of k-Medians over Data Streams Under Memory Constraints
Zhi-Hong Chong,Jeffrey Xu Yu,Zhen-Jie Zhang,Xue-Min Lin,Wei Wang,and Ao-Ying Zhou.Efficient Computation of k-Medians over Data Streams Under Memory Constraints[J].Journal of Computer Science and Technology,2006,21(2):284-296.
Authors:Zhi-Hong Chong  Jeffrey Xu Yu  Zhen-Jie Zhang  Xue-Min Lin  Wei Wang  Ao-Ying Zhou
Affiliation:(1) Department of Computer Science and Engineering, Fudan University, Shanghai, 200433, P.R. China;(2) Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin N.T. Hong Kong Special Administration Region, P.R. China;(3) School of Computing, National University of Singapore, 117574, Singapore;(4) School of Computer Science and Engineering, University of New South Wales, Sydney, 2052, Australia;(5) Department of Computer Science and Engineering, Southeast University, Nanjing, 210096, P.R. China
Abstract:In this paper, we study the problem of efficiently computing k-medians over high-dimensional and high speed data streams. The focus of this paper is on the issue of minimizing CPU time to handle high speed data streams on top of the requirements of high accuracy and small memory. Our work is motivated by the following observation: the existing algorithms have similar approximation behaviors in practice, even though they make noticeably different worst case theoretical guarantees. The underlying reason is that in order to achieve high approximation level with the smallest possible memory, they need rather complex techniques to maintain a sketch, along time dimension, by using some existing off-line clustering algorithms. Those clustering algorithms cannot guarantee the optimal clustering result over data segments in a data stream but accumulate errors over segments, which makes most algorithms behave the same in terms of approximation level, in practice. We propose a new grid-based approach which divides the entire data set into cells (not along time dimension). We can achieve high approximation level based on a novel concept called (1−∊)-dominant. We further extend the method to the data stream context, by leveraging a density-based heuristic and frequent item mining techniques over data streams. We only need to apply an existing clustering once to computing k-medians, on demand, which reduces CPU time significantly. We conducted extensive experimental studies, and show that our approaches outperform other well-known approaches. This work is supported by the National High Technology Development 863 Program of China under Grant No. 2002AA413310 and ARC Discovery Grant, Australia under Grant Nos. DP0346004 and DP0345710. Zhi-Hong Chong received his B.S. degree in computer science from Nanjing Meteorological Institute and M.S. degrees in economics from Institute of Fiscal Studies, Ministry of Finance, P.R. China, in 1991, 1999, respectively. He is currently a Ph.D. candidate in Fudan University, China. His research interests cover data mining, data streams. He has published several research papers in these areas in major international conferences and reputable journals. Jeffrey Xu Yu received his B.E., M.E. and Ph.D. degrees in computer science, from the University of Tsukuba, Japan, in 1985, 1987 and 1990, respectively. Jeffrey Xu Yu was a research fellow (Apr. 1990–Mar. 1991) and a faculty member (Apr. 1991–July 1992) in the Institute of Information Sciences and Electronics, University of Tsukuba. From July 1992 to June 2000, he was a lecturer in the Department of Computer Science, The Australian National University. Currently, he is an associate professor in the Department of Systems Engineering and Engineering Management, the Chinese University of Hong Kong. Jeffrey Xu Yu is a member of ACM, and a member of IEEE Computer Society. His interests cover XML database, data mining, data warehouse and on-line analytical processing, web-technology, query processing and query optimization, bioinformatics, wireless information systems, design and implementation of database management systems. Zhen-Jie Zhang received his B.S. degree from Department of Computer Science and Engineering, Fudan University in 2004. Currently he is a Ph.D. candidate in the School of Computing, National University of Singapore. His current research interests include skyline query, clustering and ranking database. Xue-Min Lin is an associate professor (reader) in the School of Computer Science and Engineering, the University of New South Wales. Currently, he is the head of database research group. Dr. Lin received his Ph.D. degree in computer science from the University of Queensland (Australia) in 1992 and his B.Sc. degree in applied math from Fudan University (China) in 1984. During 1984–1988, he studied for Ph.D. in applied math at Fudan University. Dr Lin’s principal research areas cover databases and graph visualisation. Wei Wang is currently a lecturer in the School of Computer Science and Engineering, The University of New South Wales, Australia. His current research interests include query processing and optimization for XML, integration of database and information retrieval technologies, data warehousing and OLAP, approximate query processing and data mining. He has published over twenty research papers in these areas in major international conferences. He received his Ph.D. degree in computer science from The Hong Kong University of Science and Technology, Hong Kong, China, in 2004, and B.Eng. degree in computer science and engineering from Shanghai Jiao Tong University, Shanghai, China, in 1999. Ao-Ying Zhou received his B.Sc. and M.Sc. degrees from Computer Department of Sichuan University in 1985 and 1988, Ph.D. degree in computer software from Fudan University in 1993. From 1996 to 1999 he was appointed the vice director of Computer Science Department of Fudan University and the director from 1999 to 2002. Prof. Zhou is engaged in research on data mining and business intelligence, web data management and peer-to-peer computing.
Keywords:data streams  k-medians  cluster  data mining
本文献已被 维普 万方数据 SpringerLink 等数据库收录!
点击此处可从《计算机科学技术学报》浏览原始摘要信息
点击此处可从《计算机科学技术学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号