首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 406 毫秒
1.

A trie structure is frequently used for various applications, such as natural language dictionaries, database systems and compilers. However, the total number of nodes of the trie becomes large and it takes a lot of spaces for a huge set of keys. The space cost becomes a serious problem if long strings, or compound words, are stored in the trie. In order to resolve this disadvantage, this paper presents a compression method for these long strings by using trie arc for single words. The concept of the compression scheme to be presented is to replace long strings into the corresponding leaf node numbers of the trie. The double array structure is introduced because a fast backward tracing of the trie is required in this approach. The theoretical and experimental observations show that the method presented is more practical than existing ones.  相似文献   

2.
A string dictionary is a basic tool for storing a set of strings in many kinds of applications. Recently, many applications need space-efficient dictionaries to handle very large datasets. In this paper, we propose new compressed string dictionaries using improved double-array tries. The double-array trie is a data structure that can implement a string dictionary supporting extremely fast lookup of strings, but its space efficiency is low. We introduce approaches for improving the disadvantage. From experimental evaluations, our dictionaries can provide the fastest lookup compared to state-of-the-art compressed string dictionaries. Moreover, the space efficiency is competitive in many cases.  相似文献   

3.
We propose a novel compression scheme to store neighbour lists for iterative solvers that employ Smoothed Particle Hydrodynamics (SPH). The compression scheme is inspired by Stream VByte, but uses a non-linear mapping from data to data bytes, yielding memory savings of up to 87%. It is part of a novel variant of the Cell-Linked-List (CLL) concept that is inspired by compact hashing with an improved processing of the cell-particle relations. We show that the resulting neighbour search outperforms compact hashing in terms of speed and memory consumption. Divergence-Free SPH (DFSPH) scenarios with up to 1.3 billion SPH particles can be processed on a 24-core PC using 172 GB of memory. Scenes with more than 7 billion SPH particles can be processed in a Message Passing Interface (MPI) environment with 112 cores and 880 GB of RAM. The neighbour search is also useful for interactive applications. A DFSPH simulation step for up to 0.2 million particles can be computed in less than 40 ms on a 12-core PC.  相似文献   

4.
Speed and storage capacity are very important issues in information retrieval system. In natural language analysis, double array is a well-known data structure to implement the trie, which is a widely used approach to retrieve strings in a dictionary. Moreover, double array helps for fast access in a matrix table with compactness of a list form. In order to realize quite compact structure for information retrieval, this paper presents a compression method by dividing the trie constructed into several pieces (pages). This compression method enables us to reduce the number of bits representing entries of the double array. The obtained trie must trace to the pages that cause slow retrieval time, because of a state connection. To solve this problem, this paper proposes a new trie construction method to compress and minimize the number of state connections. Experimental results applying for a large set of keys show that the storage capacity has been reduced to 50%. Moreover, our new approach has the same retrieval speed as the old one.  相似文献   

5.
A scalable P2P platform for the knowledge grid   总被引:8,自引:0,他引:8  
The knowledge grid needs to operate with a scalable platform to provide large-scale intelligent services. A key function of such a platform is to efficiently support various complex queries in a dynamic large-scale network environment. This paper proposes a platform to support index-based path queries by incorporating a semantic overlay with an underlying structured P2P network that provides object location and management services. Various distributed indexing structures can be dynamically formed by publishing, semantic objects as indexing nodes. Queries are forwarded along the chains of semantic object pointers to search for objects. We investigate the deployment of a scalable distributed trie index for broadcast queries on key strings, propose a decentralized load balancing method for solving the problem of uneven load distribution incurred by heterogeneity of loads and node capacities and by the distributed trie index, and give an approach for improving the availability of the semantic overlay and its trie index. Experiments demonstrate the scalability of the proposed platform.  相似文献   

6.
External sorting of large files of records involves use of disk space to store temporary files, processing time for sorting, and transfer time between CPU, cache, memory, and disk. Compression can reduce disk and transfer costs, and, in the case of external sorts, cut merge costs by reducing the number of runs. It is therefore plausible that overall costs of external sorting could be reduced through use of compression. In this paper, we propose new compression techniques for data consisting of sets of records. The best of these techniques, based on building a trie of variable-length common strings, provides fast compression and decompression and allows random access to individual records. We show experimentally that our trie-based compression leads to significant reduction in sorting costs; that is, it is faster to compress the data, sort it, and then decompress it than to sort the uncompressed data. While the degree of compression is not quite as great as can be obtained with adaptive techniques such as Lempel-Ziv methods, these cannot be applied to sorting. Our experiments show that, in comparison to approaches such as Huffman coding of fixed-length substrings, our novel trie-based method is faster and provides greater size reductions. Preliminary versions of parts of this paper, not including the work on vargram compression” [41]  相似文献   

7.
The K-Nearest Neighbor (K-NN) search problem is the way to find the K closest and most similar objects to a given query. The K-NN is essential for many applications such as information retrieval and visualization, machine learning and data mining. The exponential growth of data imposes to find approximate approaches to this problem. Permutation-based indexing is one of the most recent techniques for approximate similarity search. Objects are represented by permutation lists ordering their distances to a set of selected reference objects, following the idea that two neighboring objects have the same surrounding. In this paper, we propose a novel quantized representation of permutation lists with its related data structure for effective retrieval on single and multicore architectures. Our novel permutation-based indexing strategy is built to be fast, memory efficient and scalable. This is experimentally demonstrated in comparison to existing proposals using several large-scale datasets of millions of documents and of different dimensions.  相似文献   

8.
The emerging non-volatile memory technologies provide a new choice for storing persistent data in memory. Therefore, file system structure needs re-studying and re-designing. Our goal is to design a framework that gives high-performance in-memory file accesses and allows a file whose data can be stored across memory and block device. This paper presents a novel unified framework for in-memory and hybrid memory file systems based on the concept that each file has a contiguous “File Virtual Address Space”. Within this framework, the file access for in-memory data can be efficiently handled by address translation hardware and the virtual address space of file. The file accesses for data in block devices are handled by a dedicated page fault handler for file system. A file system called Hybrid Memory File System (HMFS) is implemented based on this framework. Experimental results show that the throughput of HMFS approaches the memory bus bandwidth in best cases. Compared with in-memory file systems, HMFS reaches 5 times, 2.1 times, and 1.6 times faster than EXT4 on Ramdisk, RAMFS, and PMFS, respectively. Compared with EXT4 on SSD and EXT4 using page cache, HMFS also achieves 100 times and tens of times performance improvement, respectively.  相似文献   

9.
In recent years, several authors have presented algorithms that locate instances of a given string, or set of strings, within a text. Recently, authors have given less consideration to the complementary problem of processing a text to find out what strings appear in the text, without any preconceived notion of what strings might be present. A system called PATRICIA, which was developed two decades ago, is an implementation of a solution to this problem. The design of PATRICIA is very tightly bound to the assumptions that individual string elements are bits and that the user of the system can provide complete lists of starting and stopping places for strings. This paper presents an approach that drops these assumptions. Our method allows different definitions of indivisible string elements for different applications, and the only information the user provides for the determination of the beginning and ends of strings is a specification of a maximum length for output strings. This paper also describes a portable C implementation of the method, called PORTREP. The primary data structure of PORTREP is a trie represented as a ternary tree. PORTREP has a method for eliminating redundancy from the output, and it can function with a bounded number of nodes by employing a heuristic process that reuses seldom-visited nodes. Theoretical analysis and empirical studies, reported here, give confidence in the efficiency of the algorithms. PORTREP has the ability to form the basis for a variety of text-analysis applications, and this paper considers one such application, automatic document indexing.  相似文献   

10.
Our research shows that for large databases, without considerable additional storage overhead, cluster-based retrieval (CBR) can compete with the time efficiency and effectiveness of the inverted index-based full search (FS). The proposed CBR method employs a storage structure that blends the cluster membership information into the inverted file posting lists. This approach significantly reduces the cost of similarity calculations for document ranking during query processing and improves efficiency. For example, in terms of in-memory computations, our new approach can reduce query processing time to 39% of FS. The experiments confirm that the approach is scalable and system performance improves with increasing database size. In the experiments, we use the cover coefficient-based clustering methodology (C3M), and the Financial Times database of TREC containing 210 158 documents of size 564 MB defined by 229 748 terms with total of 29 545 234 inverted index elements. This study provides CBR efficiency and effectiveness experiments using the largest corpus in an environment that employs no user interaction or user behavior assumption for clustering.  相似文献   

11.
A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient string-similarity-join algorithm. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and suffer from the following limitations: (1) They are inefficient for the data sets with short strings (the average string length is not larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel method called trie-join, which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find similar string pairs based on subtrie pruning. We devise efficient trie-join algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. We conducted extensive experiments on four real data sets. Experimental results show that our algorithms outperform state-of-the-art methods by an order of magnitude on the data sets with short strings.  相似文献   

12.
《Computer Communications》1999,22(15-16):1415-1422
The key to the success of the next generation IP networks to provide good services relies on the deployment of high performance routers to do fast IP routing lookups. In this paper, we propose a new algorithm for fast IP lookups using a so-called two-trie structure. The two-trie structure provides the advantages in that less memory space is required for representing a routing table than the standard trie while it still provides fast IP lookups. Based on the simulation result, the memory space can be saved around 27% over the standard trie while a lookup operation takes 1.6 memory accesses in the average case and 8 memory accesses in the worst case. Also, the structure is not based on any assumptions about the distribution of the prefix lengths in routing tables. Thus, increasing the lengths from 32 to 128 bit (from IPv4 to IPv6) does not affect the main structure.  相似文献   

13.
This paper presents the scalable on-line execution (SOLE) algorithm for continuous and on-line evaluation of concurrent continuous spatio-temporal queries over data streams. Incoming spatio-temporal data streams are processed in-memory against a set of outstanding continuous queries. The SOLE algorithm utilizes the scarce memory resource efficiently by keeping track of only the significant objects. In-memory stored objects are expired (i.e., dropped) from memory once they become insignificant. SOLE is a scalable algorithm where all the continuous outstanding queries share the same buffer pool. In addition, SOLE is presented as a spatio-temporal join between two input streams, a stream of spatio-temporal objects and a stream of spatio-temporal queries. To cope with intervals of high arrival rates of objects and/or queries, SOLE utilizes a load-shedding approach where some of the stored objects are dropped from memory. SOLE is implemented as a pipelined query operator that can be combined with traditional query operators in a query execution plan to support a wide variety of continuous queries. Performance experiments based on a real implementation of SOLE inside a prototype of a data stream management system show the scalability and efficiency of SOLE in highly dynamic environments. This work was supported in part by the National Science Foundation under Grants IIS-0093116, IIS-0209120, and 0010044-CCR.  相似文献   

14.
A double‐array is a well‐known data structure to implement the trie. However, the space efficiency of the double‐array degrades with the number of key deletions because the double‐array keeps empty elements produced by the key deletion. This paper presents a fast and compact elimination method of empty elements using properties of the trie nodes that have no siblings. The present elimination method is implemented by C language. From simulation results for large sets of keys, the present elimination method is about 30–330 times faster than the conventional elimination method and maintains high space efficiency. Copyright © 2003 John Wiley & Sons, Ltd.  相似文献   

15.
A minimal prefix (MP) double array is an efficient data structure for a trie. The MP double array only requires a small amount of space and enables fast retrieval. However, the space efficiency of the MP double array is degraded by deletion. This paper presents a fast and compact adaptive deletion method for the MP double array. The presented method is implemented with C. Simulation results for English and Japanese keys show that the adaptive method is faster than the conventional method and maintains higher space efficiency. Copyright © 2006 John Wiley & Sons, Ltd.  相似文献   

16.
数据结构Trie及其应用   总被引:3,自引:0,他引:3  
许多计算机应用都涉及字符串处理.为了提高处理效率,设计一个好的数据结构十分重要.本文简要分析了几种常用字符串的数据结构及其性能,重点分析了数据结构Trie的三种形式的结构特性,最后以Trie在IP地址查找中的应用为实例说明了Trie的实际应用方法.  相似文献   

17.
A heuristic algorithm for testing absolute irreducibility of multivariate polynomials over arbitrary fields using Newton polytopes was proposed in Gao and Lauder (Discrete Comput. Geom. 26:89–104, [2001]). A preliminary implementation by Gao and Lauder (2003) established a wide range of families of low degree and sparse polynomials for which the algorithm works efficiently and with a high success rate. In this paper, we develop a BSP variant of the absolute irreducibility testing via polytopes, with the aim of producing a more memory and run-time efficient method that can provide wider ranges of applicability, specifically in terms of the degrees of the input polynomials. In the bivariate case, we describe a balanced load scheme and a corresponding data distribution leading to a parallel algorithm whose efficiency can be established under reasonably realistic conditions. This is later incorporated in a doubly parallel algorithm in the multivariate case that achieves similar scalable performance. Both parallel models are analyzed for efficiency, and the theoretical analysis is compared to the performance of our experiments. In the empirical results we report, we achieve absolute irreducibility testing for bivariate and trivariate polynomials of degrees up to 30,000, and for low degree multivariate polynomials with more than 3,000 variables. To the best of our knowledge, this sets a world record in establishing absolute irreducibility of multivariate polynomials.  相似文献   

18.
We have witnessed exciting development of RAM technology in the past decade. The memory size grows rapidly and the price continues to decrease, so that it is feasible to deploy large amounts of RAM in a computer system. Several companies and research institutions have devoted a lot of resources to develop in-memory databases (IMDB) that implement queries after loading data into (virtual) memory in advance. The bloom of various in-memory databases pursues us to test and evaluate their performance objectively and fairly. Although the existing database benchmarks like Wisconsin benchmark and TPC-X series have achieved great success, they cannot suit for in-memory databases due to the lack of consideration of unique characteristics of an IMDB. In this study, we propose MemTest, a novel benchmark that concerns some major characteristics of an in-memory database. This benchmark constructs particular metrics, which cover processing time, compression ratio, minimal memory space and column strength of an in-memory database. We design a data model based on inter-bank transaction applications, and a data generator to support uniform and skew data distributions. The MemTest workload includes a set of queries and transactions against the metrics and data model. Finally, we illustrate the efficacy of MemTest through the implementations on two different in-memory databases.  相似文献   

19.
Concept learning provides a natural framework in which to place the problems solved by the quantum algorithms of Bernstein-Vazirani and Grover. By combining the tools used in these algorithms—quantum fast transforms and amplitude amplification—with a novel (in this context) tool—a solution method for geometrical optimization problems—we derive a general technique for quantum concept learning. We name this technique “Amplified Impatient Learning” and apply it to construct quantum algorithms solving two new problems: Battleship and Majority, more efficiently than is possible classically.  相似文献   

20.
Finite‐state automata are important components in information retrieval and natural language processing software. A recursive automaton is the most compact representation of the acyclic deterministic finite‐state automata. It is based on merging not only the equivalent states but also the identical substructures in an automaton. The LZ trie variant is the state‐of‐the‐art in automata compression regarding space, but the time needed for its construction was, until now, quadratic, which has made it impractical for large inputs. In this paper, we present the first algorithm for LZ trie construction that runs in effectively linear time thereby making it an attractive choice for finite‐state automata implementation. We achieve this goal by adding a new functionality to the enhanced suffix array data structure. We present two variants of the construction procedure – an optimal method regarding the final size and a method that sacrifices some compression for low intermediate memory usage. We have made the implementation of our algorithms available in an open source software package LzLex.Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号