This paper concerns the following problem: given a set of multi-attribute records, a fixed number of buckets and a two-disk system, arrange the records into the buckets and then store the buckets between the disks in such a way that, over all possible orthogonal range queries (ORQs), the disk access concurrency is maximized. We shall adopt the multiple key hashing (MKH) method for arranging records into buckets and use the disk modulo (DM) allocation method for storing buckets onto disks. Since the DM allocation method has been shown to be superior to any other allocation methods for allocating an MKH file onto a two-disk system for answering ORQs, the real issue is knowing how to determine an optimal way for organizing the records into buckets based upon the MKH concept.
A performance formula that can be used to evaluate the average response time, over all possible ORQs, of an MKH file in a two-disk system using the DM allocation method is first presented. Based upon this formula, it is shown that our design problem is related to a notoriously difficult problem, namely the Prime Number Problem. Then a performance lower bound and an efficient algorithm for designing optimal MKH files in certain cases are presented. It is pointed out that in some cases the optimal MKH file for ORQs in a two-disk system using the DM allocation method is identical to the optimal MKH file for ORQs in a single-disk system and the optimal average response time in a two-disk system is slightly greater than one half of that in a single-disk system. 相似文献
Conventional image hash functions only exploit luminance components of color images to generate robust hashes and then lead to limited discriminative capacities. In this paper, we propose a robust image hash function for color images, which takes all components of color images into account and achieves good discrimination. Firstly, the proposed hash function re-scales the input image to a fixed size. Secondly, it extracts local color features by converting the RGB color image into HSI and YCbCr color spaces and calculating the block mean and variance from each component of the HSI and YCbCr representations. Finally, it takes the Euclidian distances between the block features and a reference feature as hash values. Experiments are conducted to validate the efficiency of our hash function. Receiver operating characteristics (ROC) curve comparisons with two existing algorithms demonstrate that our hash function outperforms the assessed algorithms in classification performances between perceptual robustness and discriminative capability. 相似文献
In many-task computing (MTC), applications such as scientific workflows or parameter sweeps communicate via intermediate files; application performance strongly depends on the file system in use. The state of the art uses runtime systems providing in-memory file storage that is designed for data locality: files are placed on those nodes that write or read them. With data locality, however, task distribution conflicts with data distribution, leading to application slowdown, and worse, to prohibitive storage imbalance. To overcome these limitations, we present MemFS, a fully symmetrical, in-memory runtime file system that stripes files across all compute nodes, based on a distributed hash function. Our cluster experiments with Montage and BLAST workflows, using up to 512 cores, show that MemFS has both better performance and better scalability than the state-of-the-art, locality-based file system, AMFS. Furthermore, our evaluation on a public commercial cloud validates our cluster results. On this platform MemFS shows excellent scalability up to 1024 cores and is able to saturate the 10G Ethernet bandwidth when running BLAST and Montage. 相似文献
A better similarity index structure for high-dimensional feature datapoints is very desirable for building scalable content-based search systems on feature-rich dataset. In this paper, we introduce sparse principal component analysis (Sparse PCA) and Boosting Similarity Sensitive Hashing (Boosting SSC) into traditional spectral hashing for both effective and data-aware binary coding for real data. We call this Sparse Spectral Hashing (SSH). SSH formulates the problem of binary coding as a thresholding a subset of eigenvectors of the Laplacian graph by constraining the number of nonzero features. The convex relaxation and eigenfunction learning are conducted in SSH to make the coding globally optimal and effective to datapoints outside the training data. The comparisons in terms of F1 score and AUC show that SSH outperforms other methods substantially over both image and text datasets. 相似文献
The problem of efficiently finding similar items in a large corpus of high-dimensional data points arises in many real-world
tasks, such as music, image, and video retrieval. Beyond the scaling difficulties that arise with lookups in large data sets,
the complexity in these domains is exacerbated by an imprecise definition of similarity. In this paper, we describe a method
to learn a similarity function from only weakly labeled positive examples. Once learned, this similarity function is used
as the basis of a hash function to severely constrain the number of points considered for each lookup. Tested on a large real-world
audio dataset, only a tiny fraction of the points (~0.27%) are ever considered for each lookup. To increase efficiency, no
comparisons in the original high-dimensional space of points are required. The performance far surpasses, in terms of both
efficiency and accuracy, a state-of-the-art Locality-Sensitive-Hashing-based (LSH) technique for the same problem and data
set. 相似文献