首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 469 毫秒
1.
In multimedia information retrieval, multimedia data are represented as vectors in high-dimensional space. To search these vectors efficiently, a variety of indexing methods have been proposed. However, the performance of these indexing methods degrades dramatically with increasing dimensionality, which is known as the dimensionality curse. To resolve the dimensionality curse, dimensionality reduction methods have been proposed. They map feature vectors in high-dimensional space into vectors in low-dimensional space before the data are indexed. This paper proposes a novel method for dimensionality reduction based on a function that approximates the Euclidean distance based on the norm and angle components of a vector. First, we identify the causes of, and discuss basic solutions to, errors in angle approximation during the approximation of the Euclidean distance. Then, this paper propose a new method for dimensionality reduction that extracts a set of subvectors from a feature vector and maintains only the norm and the approximated angle for every subvector. The selection of a good reference vector is crucial for accurate approximation of the angle component. We present criteria for being a good reference vector, and propose a method that chooses a good reference vector. Also, we define a novel distance function using the norm and angle components, and formally prove that the distance function consistently lower-bounds the Euclidean distance. This implies information retrieval with this function does not incur any false dismissals. Finally, the superiority of the proposed approach is verified via extensive experiments with synthetic and real-life data sets.
Byung-Uk ChoiEmail:
  相似文献   

2.
Due to the famous dimensionality curse problem, search in a high-dimensional space is considered as a "hard" problem. In this paper, a novel composite distance transformation method, which is called CDT, is proposed to support a fast κ-nearest-neighbor (κ-NN) search in high-dimensional spaces. In CDT, all (n) data points are first grouped into some clusters by a κ-Means clustering algorithm. Then a composite distance key of each data point is computed. Finally, these index keys of such n data points are inserted by a partition-based B^+-tree. Thus, given a query point, its κ-NN search in high-dimensional spaces is transformed into the search in the single dimensional space with the aid of CDT index. Extensive performance studies are conducted to evaluate the effectiveness and efficiency of the proposed scheme. Our results show that this method outperforms the state-of-the-art high-dimensional search techniques, such as the X-Tree, VA-file, iDistance and NB-Tree.  相似文献   

3.
In multimedia retrieval, a query is typically interactively refined towards the “optimal” answers by exploiting user feedback. However, in existing work, in each iteration, the refined query is re-evaluated. This is not only inefficient but fails to exploit the answers that may be common between iterations. Furthermore, it may also take too many iterations to get the “optimal” answers. In this paper, we introduce a new approach called OptRFS (optimizing relevance feedback search by query prediction) for iterative relevance feedback search. OptRFS aims to take users to view the “optimal” results as fast as possible. It optimizes relevance feedback search by both shortening the searching time during each iteration and reducing the number of iterations. OptRFS predicts the potential candidates for the next iteration and maintains this small set for efficient sequential scan. By doing so, repeated candidate accesses (i.e., random accesses) can be saved, hence reducing the searching time for the next iteration. In addition, efficient scan on the overlap before the next search starts also tightens the search space with smaller pruning radius. As a step forward, OptRFS also predicts the “optimal” query, which corresponds to “optimal” answers, based on the early executed iterations’ queries. By doing so, some intermediate iterations can be saved, hence reducing the total number of iterations. By taking the correlations among the early executed iterations into consideration, OptRFS investigates linear regression, exponential smoothing and linear exponential smoothing to predict the next refined query so as to decide the overlap of candidates between two consecutive iterations. Considering the special features of relevance feedback, OptRFS further introduces adaptive linear exponential smoothing to self-adjust the parameters for more accurate prediction. We implemented OptRFS and our experimental study on real life data sets show that it can reduce the total cost of relevance feedback search significantly. Some interesting features of relevance feedback search are also discovered and discussed.  相似文献   

4.
Content based image retrieval represents images as N- dimensional feature vectors. The k images most similar to a target image, i.e., those closest to its feature vector, are determined by applying a k-nearest-neighbor (k-NN) query. A sequential scan of the feature vectors for k-NN queries is costly for a large number of images when N is high. The search space can be reduced by indexing the data, but the effectiveness of multidimensional indices is poor for high dimensional data. Building indices on dimensionality reduced data is one method to improve indexing efficiency. We utilize the Singular Value Decomposition (SVD) method to attain dimensionality reduction (DR) with minimum information loss for static data. Clustered SVD (CSVD) combines clustering with SVD to attain a lower normalized mean square error (NMSE) by taking advantage of the fact that most real-world datasets exhibit local rather than global correlations. The Local Dimensionality Reduction (LDR) method differs from CSVD in that it uses an SVD-friendly clustering method, rather than the k-means clustering method. We propose a hybrid method which combines the clustering method of LDR with the DR method of CSVD, so that the vector of the number of retained dimensions of the clusters is determined by varying the NMSE. We build SR-tree indices based on the vectors in the clusters to determine the number of accessed pages for exact k-NN queries (Thomasian et al., Inf Process Lett - IPL 94(6):247–252, 2005) (see Section A.3 versus the NMSE. Varying the NMSE a minimum cost can be found, because the lower cost of accessing a smaller index is offset with the higher postprocessing cost resulting from lower retrieval accuracy. Experimenting with one synthetic and three real-world datasets leads to the conclusion that the lowest cost is attained at NMSE ≈ 0.03 and between 1/3 and 1/2 of the number of dimensions are retained. In one case doubling the number of dimensions cuts the number of accessed pages by one half. The Appendix provides the requisite background information for reading this paper.
Lijuan ZhangEmail:

Alexander Thomasian   is a Professor of Computer Science at NJIT. He was a faculty member at Case Western University and University of Southern California and an adjunct professor at Columbia University, while a Research Staff Member at the IBM T. J. Watson Research Center (1985–1998). He received his Masters and PhD degrees in Computer Science from UCLA. Dr. Thomasian’s research has more recently been focused on indexing high-dimensional datasets and the performance of storage systems. He has contributed to the performance analysis area and especially the analysis and synthesis of concurrency control methods. He has published over 50 journal and over 60 conference papers. He holds four patents, received innovation and invention awards at IBM. He has served as an area editor of the IEEE Trans. Parallel and Distributed Systems and has been on the program committees of numerous conferences. He has given numerous tutorials on storage systems, high performance systems for database applications, etc. He is the author of Database Concurrency Control: Methods, Performance, and Analysis, Kluwer 1996. Dr. Thomasian is a member of ACM and a Fellow of IEEE. Yue Li   started her PhD studies at NJIT in Fall 2000, after completing her MS degree in Computer Science at Shandong University, Jinan, China. Her PhD thesis on “Efficient similarity search in high dimensional data” was completed in May 2004. Dr Li is a Software Engineer at AIG Software in NJ. She is the author of a half a dozen publications. Lijuan Zhang   received her Master’s degree in Computer Science from Northeastern University, China in 1999 and PhD degree in Computer Science from NJIT in 2005 with a dissertation in highdimensional indexing methods. She was a software engineer in Huawei Technologies, China. In 2005, She joined Amicas, Inc as a software engineer focusing on picture archiving and communication technologies. Her research interest was in high-dimensional indexing techniques, similarity search, content-based image retrieval, time series etc. Her current interest is in medical imaging and information management, including DICOM, HL7.   相似文献   

5.
Noise in textual data such as those introduced by multilinguality, misspellings, abbreviations, deletions, phonetic spellings, non-standard transliteration, etc. pose considerable problems for text-mining. Such corruptions are very common in instant messenger and short message service data and they adversely affect off-the-shelf text mining methods. Most techniques address this problem by supervised methods by making use of hand labeled corrections. But they require human generated labels and corrections that are very expensive and time consuming to obtain because of multilinguality and complexity of the corruptions. While we do not champion unsupervised methods over supervised when quality of results is the singular concern, we demonstrate that unsupervised methods can provide cost effective results without the need for expensive human intervention that is necessary to generate a parallel labeled corpora. A generative model based unsupervised technique is presented that maps non-standard words to their corresponding conventional frequent form. A hidden Markov model (HMM) over a “subsequencized” representation of words is used, where a word is represented as a bag of weighted subsequences. The approximate maximum likelihood inference algorithm used is such that the training phase involves clustering over vectors and not the customary and expensive dynamic programming (Baum–Welch algorithm) over sequences that is necessary for HMMs. A principled transformation of maximum likelihood based “central clustering” cost function of Baum–Welch into a “pairwise similarity” based clustering is proposed. This transformation makes it possible to apply “subsequence kernel” based methods that model delete and insert corruptions well. The novelty of this approach lies in that the expensive (Baum–Welch) iterations required for HMM, can be avoided through an approximation of the loglikelihood function and by establishing a connection between the loglikelihood and a pairwise distance. Anecdotal evidence of efficacy is provided on public and proprietary data.  相似文献   

6.
利用高维数据空间合理划分,提出一种简单有效的KNN检索算法-LBD。通过聚类将数据划分成多个子集空间,对每个聚类子集内的高维向量,利用距离和位码定义简化表示形式。KNN搜索时,首先利用距离信息确定候选范围,然后利用某些维上的位码不相同信息进一步缩小搜索范围,提高剪枝效率。位码字符串比较时,按照维度贡献优先顺序,大大加快非候选点过滤。LBD利用特殊的B+树组织,降低I/O和距离计算代价。采用模拟数据和真实数据,实验验证了LBD具有更高的检索效率。  相似文献   

7.
The k Nearest Neighbor (kNN) join operation associates each data object in one data set with its k nearest neighbors from the same or a different data set. The kNN join on high-dimensional data (high-dimensional kNN join) is a very expensive operation. Existing high-dimensional kNN join algorithms were designed for static data sets and therefore cannot handle updates efficiently. In this article, we propose a novel kNN join method, named kNNJoin +, which supports efficient incremental computation of kNN join results with updates on high-dimensional data. As a by-product, our method also provides answers for the reverse kNN queries with very little overhead. We have performed an extensive experimental study. The results show the effectiveness of kNNJoin+ for processing high-dimensional kNN joins in dynamic workloads.  相似文献   

8.
Similarity search is important in information-retrieval applications where objects are usually represented as vectors of high dimensionality. This paper proposes a new dimensionality-reduction technique and an indexing mechanism for high-dimensional datasets. The proposed technique reduces the dimensions for which coordinates are less than a critical value with respect to each data vector. This flexible datawise dimensionality reduction contributes to improving indexing mechanisms for high-dimensional datasets that are in skewed distributions in all coordinates. To apply the proposed technique to information retrieval, a CVA file (compact VA file), which is a revised version of the VA file is developed. By using a CVA file, the size of index files is reduced further, while the tightness of the index bounds is held maximally. The effectiveness is confirmed by synthetic and real data.  相似文献   

9.
Finding test data to cover structural test coverage criteria such as branch coverage is largely a manual and hence expensive activity. A potential low cost alternative is to generate the required test data automatically. Search-based test data generation is one approach that has attracted recent interest. This approach is based on the definition of an evaluation or cost function that is able to discriminate between candidate test cases with respect to achieving a given test goal. The cost function is implemented by appropriate instrumentation of the program under test. The candidate test is then executed on the instrumented program. This provides an evaluation of the candidate test in terms of the “distance” between the computation achieved by the candidate test and the computation required to achieve the test goal. Providing the cost function is able to discriminate reliably between candidate tests that are close or far from covering the test goal and the goal is feasible, a search process is able to converge to a solution, i.e., a test case that satisfies the coverage goal. For some programs, however, an informative cost function is difficult to define. The operations performed by these programs are such that the cost function returns a constant value for a very wide range of inputs. A typical example of this problem arises in the instrumentation of branch predicates that depend on the value of a Boolean-valued (flag) variable although the problem is not limited to programs that contain flag variables. Although methods are known for overcoming the problems of flag variables in particular cases, the more general problem of a near constant cost function has not been tackled. This paper presents a new heuristic for directing the search when the cost function at a test goal is not able to differentiate between candidate test inputs. The heuristic directs the search toward test cases that produce rare or scarce data states. Scarce inputs for the cost function are more likely to produce new cost values. The proposed method is evaluated empirically for a number of example programs for which existing methods are inadequate.  相似文献   

10.
在高维空间KNN查询算法中,近似向量和一维转换表示法能有效克服维数灾难,结合这两种思想,提出一种基于区位码和距离的索引结构(BD)以实现快速KNN查询.根据高维空间向量分布特点,合理分区使得大量分布在空间表面的点尽可能地划分到不同的分区中,提高检索剪枝效率.引入区位码概念和转换函数,将高维向量近似表示并转换为一维数值形式,组织成B 树索引.利用快速KNN查询算法,实现两层过滤,缩小搜索范围,降低树搜索代价.采用模拟数据和真实数据,大量实验验证了BD比其他同类索引具有更高的检索效率.  相似文献   

11.
We present a randomized procedure named Hierarchical Sampling from Sketches (Hss) that can be used for estimating a class of functions over the frequency vector f of update streams of the form . We illustrate this by applying the Hss technique to design nearly space-optimal algorithms for estimating the pth moment of the frequency vector, for real p≥2 and for estimating the entropy of a data stream. Preliminary version of this paper appeared as the following conference publications. “Simpler algorithm for estimating frequency moments of data streams,” Lakshminath Bhuvanagiri, Sumit Ganguly, Deepanjan Kesh and Chandan Saha, Proceedings of the ACM Symposium on Discrete Algorithms, 2006, pp. 708–713 and “Estimating entropy over data streams,” Lakshminath Bhuvanagiri and Sumit Ganguly, Proceedings of the European Symposium on Algorithms, LNCS, vol. 4168, pp. 148–159, Springer, 2006.  相似文献   

12.
The notorious “dimensionality curse” is a well-known phenomenon for any multi-dimensional indexes attempting to scale up to high dimensions. One well-known approach to overcome degradation in performance with respect to increasing dimensions is to reduce the dimensionality of the original dataset before constructing the index. However, identifying the correlation among the dimensions and effectively reducing them are challenging tasks. In this paper, we present an adaptive Multi-level Mahalanobis-based Dimensionality Reduction (MMDR) technique for high-dimensional indexing. Our MMDR technique has four notable features compared to existing methods. First, it discovers elliptical clusters for more effective dimensionality reduction by using only the low-dimensional subspaces. Second, data points in the different axis systems are indexed using a single B +-tree. Third, our technique is highly scalable in terms of data size and dimension. Finally, it is also dynamic and adaptive to insertions. An extensive performance study was conducted using both real and synthetic datasets, and the results show that our technique not only achieves higher precision, but also enables queries to be processed efficiently.  相似文献   

13.
Recently, databases have been used to store multimedia data such as images, maps, video clips, and music clips. In order to search them, they should be represented by various features, which are composed of high-dimensional vectors. As a result, the dimensionality of data is increased considerably, which causes ‘the curse of dimensionality’. The increase of data dimensionality causes poor performance of index structures. To overcome the problem, the research on the dimensionality reduction has been conducted. However, some reduction methods do not guarantee no false dismissal, while others incur high computational cost. This paper proposes dimensionality reduction techniques that guarantee no false dismissal while providing efficiency considerable by approximating distances with a few values. To provide the no false dismissal property, approximated distances should always be smaller than original distances. The Cauchy–Schwarz inequality and two trigonometrical equations are used as well as the dimension partitioning technique is applied to approximate distances in such a way to reduce the difference between the approximated distance and the original distance. As a result, the proposed techniques reduce the candidate set of a query result for efficient query processing.  相似文献   

14.
A relational ranking query uses a scoring function to limit the results of a conventional query to a small number of the most relevant answers. The increasing popularity of this query paradigm has led to the introduction of specialized rank join operators that integrate the selection of top tuples with join processing. These operators access just “enough” of the input in order to generate just “enough” output and can offer significant speed-ups for query evaluation. The number of input tuples that an operator accesses is called the input depth of the operator, and this is the driving cost factor in rank join processing. This introduces the important problem of depth estimation, which is crucial for the costing of rank join operators during query compilation and thus for their integration in optimized physical plans. We introduce an estimation methodology, termed deep, for approximating the input depths of rank join operators in a physical execution plan. At the core of deep lies a general, principled framework that formalizes depth computation in terms of the joint distribution of scores in the base tables. This framework results in a systematic estimation methodology that takes the characteristics of the data directly into account and thus enables more accurate estimates. We develop novel estimation algorithms that provide an efficient realization of the formal deep framework, and describe their integration on top of the statistics module of an existing query optimizer. We validate the performance of deep with an extensive experimental study on data sets of varying characteristics. The results verify the effectiveness of deep as an estimation method and demonstrate its advantages over previously proposed techniques.  相似文献   

15.
Summary. This paper proposes a framework for detecting global state predicates in systems of processes with approximately-synchronized real-time clocks. Timestamps from these clocks are used to define two orderings on events: “definitely occurred before” and “possibly occurred before”. These orderings lead naturally to definitions of 3 distinct detection modalities, i.e., 3 meanings of “predicate held during a computation”, namely: (“ possibly held”), (“ definitely held”), and (“ definitely held in a specific global state”). This paper defines these modalities and gives efficient algorithms for detecting them. The algorithms are based on algorithms of Garg and Waldecker, Alagar and Venkatesan, Cooper and Marzullo, and Fromentin and Raynal. Complexity analysis shows that under reasonable assumptions, these real-time-clock-based detection algorithms are less expensive than detection algorithms based on Lamport's happened-before ordering. Sample applications are given to illustrate the benefits of this approach. Received: January 1999 / Accepted: November 1999  相似文献   

16.
基于广义超曲面树的相似性搜索算法   总被引:2,自引:0,他引:2  
张兆功  李建中 《软件学报》2002,13(10):1969-1976
相似性搜索是数据挖掘的主要领域之一.它在数据库中检索出相似的数据,发现数据间的相似性.它可以应用于图像数据库、空间数据库和时间序列分析.对于欧氏空间(一种特殊的度量空间),相似性搜索算法中基于R-tree的方法,在低维时是高效的,当维数增加时,R-tre e的方法将退化为线性扫描.该现象被称为维数灾难(dimensionality curse),主要原因是存在数据重复.当数据量很大且维数很高时,距离计算和I/O操作将非常费时.提出了度量空间上新的空间分割方法和索引结构rgh-tree,利用数据库的数据对象与很少几个固定参考对象的距离信息进行数据分割和分布,产生一个各节点没有数据重复的平衡树.另外,在rgh-tree的基础上提出了相应的相似性搜索算法,该算法具有较小的I/O代价和距离计算次数,平均复杂性近似为o(n0.58).解决了目前算法存在的一些问题.  相似文献   

17.
Nearest neighbor (NN) search in high-dimensional space plays a fundamental role in large-scale image retrieval. It seeks efficient indexing and search techniques, both of which are simultaneously essential for similarity search and semantic analysis. However, in recent years, there has been a rare breakthrough. Achievement of current techniques for NN search is far from satisfactory, especially for exact NN search. A recently proposed method, HB, addresses the exact NN search efficiently in high-dimensional space. It benefits from cluster-based techniques which can generate more compact representation of the data set than other techniques by exploiting interdimensional correlations. However, HB suffers from huge cost for lower bound computations and provides no further pruning scheme for points in candidate clusters. In this paper, we extend the HB method to address exact NN search in correlated, high-dimensional vector data sets extracted from large-scale image database by introducing two new pruning/selection techniques and we call it HB+. The first approach aims at selecting more quickly the subset of hyperplanes/clusters that must be considered. The second technique prunes irrelevant points in the selected subset of clusters. Performed experiments show the improvement of HB+ with respect to HB in terms of efficiency (I/O cost and CPU response time) and also demonstrate the superiority over other exact NN indexes.  相似文献   

18.
Efficient high-dimensional indexing by sorting principal component   总被引:1,自引:0,他引:1  
The vector approximation file (VA-file) approach is an efficient high-dimensional indexing method for image retrieval in large database. Some extensions of VA-file have been proposed towards better query performance. However, all of these methods applying sequential scan need read the whole vector approximation file. In this paper, we present a new indexing structure based on vector approximation method, in which only a small part of approximation file need be accessed. First, principal component analysis is used to map multidimensional points to a 1D line. Then a B+-tree is built to index the approximate vector according to principal component. When performing k-nearest neighbor search, the partial distortion searching algorithm is used to reject the improper approximate vectors. Only a small set of approximate vectors need to be sequentially scanned during the search, which can reduce the CPU cost and I/O cost dramatically. Experiment results on large image databases show that the new approach provides a faster search speed than the other VA-file approaches.  相似文献   

19.
对顺序索引方法进行了研究,提出一种基于向量近似的高维顺序索引结构,该结构顺序访问部分文件就能完成k近邻查询。在查询过程中依据投影值来终止查询过程,依据距离来排除不匹配的数据。为进一步降低数据访问率,采用椭圆体聚类算法对数据集进行划分。新索引结构支持以多个顺序访问过程完成k近邻查询,能够同时降低查询过程中的I/O开销和CPU开销。在大型高维图像特征库上的实验表明,新的高维索引结构的查询性能优于其他高维索引方法。  相似文献   

20.
A principle of integrating neural network modules based on chaotic dynamics was studied on our two-moduled Nozawa model. Chaotic neural networks represent each embedded pattern as a low-dimensional periodic orbit, and the others are shown as high-dimensional chaotic attractors. This is equivalent to W. Freeman’s “I don’t know” and “I know” states. In particular, we noted that the combination of two-way inputs to each neural network module conflicted with embedded Hebbian correspondence. It was found that the interaction between the modules generated a novel “I know” state in addition to the embedded representation. Chaotic neural network modules can autonomously generate novel memories or functions by this interaction. The result suggests a functional integration in neural networks as it ought to be, e.g., feature binding and gestalt. This work was presented, in part, at the Fourth International Symposium on Artificial Life and Robotics, Oita, Japan, January 19–22, 1999  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号