Optimal subspace dimensionality for k-nearest-neighbor queries on clustered and dimensionality reduced datasets with SVD期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Optimal subspace dimensionality for k-nearest-neighbor queries on clustered and dimensionality reduced datasets with SVD

Authors:

Alexander Thomasian Yue Li Lijuan Zhang

Affiliation:

(1) Thomasian and Associates, 17 Meadowbrook Rd, Pleasantville, NY 10570, USA;(2) AIG Software, Jersey City, NJ, USA;(3) Amicas Inc., 20 Guest St., Boston, MA 02135, USA

Abstract:

Content based image retrieval represents images as N- dimensional feature vectors. The k images most similar to a target image, i.e., those closest to its feature vector, are determined by applying a k-nearest-neighbor (k-NN) query. A sequential scan of the feature vectors for k-NN queries is costly for a large number of images when N is high. The search space can be reduced by indexing the data, but the effectiveness of multidimensional indices is poor for high dimensional data. Building indices on dimensionality reduced data is one method to improve indexing efficiency. We utilize the Singular Value Decomposition (SVD) method to attain dimensionality reduction (DR) with minimum information loss for static data. Clustered SVD (CSVD) combines clustering with SVD to attain a lower normalized mean square error (NMSE) by taking advantage of the fact that most real-world datasets exhibit local rather than global correlations. The Local Dimensionality Reduction (LDR) method differs from CSVD in that it uses an SVD-friendly clustering method, rather than the k-means clustering method. We propose a hybrid method which combines the clustering method of LDR with the DR method of CSVD, so that the vector of the number of retained dimensions of the clusters is determined by varying the NMSE. We build SR-tree indices based on the vectors in the clusters to determine the number of accessed pages for exact k-NN queries (Thomasian et al., Inf Process Lett - IPL 94(6):247–252, 2005) (see Section A.3 versus the NMSE. Varying the NMSE a minimum cost can be found, because the lower cost of accessing a smaller index is offset with the higher postprocessing cost resulting from lower retrieval accuracy. Experimenting with one synthetic and three real-world datasets leads to the conclusion that the lowest cost is attained at NMSE ≈ 0.03 and between 1/3 and 1/2 of the number of dimensions are retained. In one case doubling the number of dimensions cuts the number of accessed pages by one half. The Appendix provides the requisite background information for reading this paper.

Lijuan ZhangEmail:

Alexander Thomasian is a Professor of Computer Science at NJIT. He was a faculty member at Case Western University and University of Southern California and an adjunct professor at Columbia University, while a Research Staff Member at the IBM T. J. Watson Research Center (1985–1998). He received his Masters and PhD degrees in Computer Science from UCLA. Dr. Thomasian’s research has more recently been focused on indexing high-dimensional datasets and the performance of storage systems. He has contributed to the performance analysis area and especially the analysis and synthesis of concurrency control methods. He has published over 50 journal and over 60 conference papers. He holds four patents, received innovation and invention awards at IBM. He has served as an area editor of the IEEE Trans. Parallel and Distributed Systems and has been on the program committees of numerous conferences. He has given numerous tutorials on storage systems, high performance systems for database applications, etc. He is the author of Database Concurrency Control: Methods, Performance, and Analysis, Kluwer 1996. Dr. Thomasian is a member of ACM and a Fellow of IEEE. MediaObjects/11042_2008_206_Figa_HTML.gif

MediaObjects/11042_2008_206_Figa_HTML.gif

Yue Li started her PhD studies at NJIT in Fall 2000, after completing her MS degree in Computer Science at Shandong University, Jinan, China. Her PhD thesis on “Efficient similarity search in high dimensional data” was completed in May 2004. Dr Li is a Software Engineer at AIG Software in NJ. She is the author of a half a dozen publications. Lijuan Zhang received her Master’s degree in Computer Science from Northeastern University, China in 1999 and PhD degree in Computer Science from NJIT in 2005 with a dissertation in highdimensional indexing methods. She was a software engineer in Huawei Technologies, China. In 2005, She joined Amicas, Inc as a software engineer focusing on picture archiving and communication technologies. Her research interest was in high-dimensional indexing techniques, similarity search, content-based image retrieval, time series etc. Her current interest is in medical imaging and information management, including DICOM, HL7. MediaObjects/11042_2008_206_Fige_HTML.gif

MediaObjects/11042_2008_206_Fige_HTML.gif

Keywords:

Multimedia databases Content-based retrieval Dimensionality reduction Singular value decomposition Clustering High dimensional data Nearest neighbor search

本文献已被 SpringerLink 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏