首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Suffix tree, suffix array, and directed acyclic word graph (DAWG) are data-structures for indexing a text. Although they enable efficient pattern matching, their data-structures require O(nlogn) bits, which make them impractical to index long text like human genome. Recently, the development of compressed data-structures allow us to simulate suffix tree and suffix array using O(n) bits. However, there is still no O(n)-bit data-structure for DAWG with full functionality. This work introduces an $n(H_{k}(\overline{S})+ 2 H_{0}^{*}(\mathcal {T}_{\overline{S}}))+o(n)$ -bit compressed data-structure for simulating DAWG (where $H_{k}(\overline{S})$ and $H_{0}^{*}(\mathcal{T}_{\overline{S}})$ are the empirical entropies of the reversed sequence and the reversed suffix tree topology, respectively.) Besides, we also propose an application of DAWG to improve the time complexity for the local alignment problem. In this application, the previously proposed solutions using BWT (a version of compressed suffix array) run in O(n 2 m) worst case time and O(n 0.628 m) average case time where n and m are the lengths of the database and the query, respectively. Using compressed DAWG proposed in this paper, the problem can be solved in O(nm) worst case time and the same average case time.  相似文献   

2.
3.
Gonzalo Navarro  Jorma Tarhio 《Software》2005,35(12):1107-1130
We present a Boyer–Moore (BM) approach to string matching over LZ78 and LZW compressed text. The idea is to search the text directly in compressed form instead of decompressing and then searching it. We modify the BM approach so as to skip text using the characters explicitly represented in the LZ78/LZW formats, modifying the basic technique where the algorithm can choose which characters to inspect. We present and compare several solutions for single and multipattern searches. We show that our algorithms obtain speedups of up to 50% compared to the simple decompress‐then‐search approach. Finally, we present a public tool, LZgrep, which uses our algorithms to offer grep‐like capabilities directly searching files compressed using Unix's Compress, a LZW compressor. LZgrep can also search files compressed with Unix gzip, using new decompress‐then‐search techniques we develop, which are faster than the current tools. This way, users can always keep their files in compressed form and still search them, uncompressing only when they want to see them. Copyright © 2005 John Wiley & Sons, Ltd.  相似文献   

4.
Given a binary string of length n, we give a representation of its suffix array that takes O(nt(lgn)1/t) bits of space such that given i,1?i?n, the ith entry in the suffix array of the string can be retrieved in O(t) time, for any parameter 1?t?lglgn. For t=lglgn, this gives a compressed suffix array representation of Grossi and Vitter [Proc. Symp. on Theory Comput., 2000, pp. 397-406]. For t=O(1/ε), this gives the best known (in terms of space) compressed suffix array representation with constant query time. From this representation one can construct a suffix tree structure for a text of length n, that uses o(nlgn) bits of space which can be used to find all the k occurrences of a given pattern of length m in O(m/lgn+k) time. No such structure was known earlier.  相似文献   

5.
The representation of large subsets of the World Wide Web in the form of a directed graph has been extensively used to analyze structure, behavior, and evolution of those so-called Web graphs. However, interesting Web graphs are very large and their classical representations do not fit into the main memory of typical computers, whereas the required graph algorithms perform inefficiently on secondary memory. Compressed graph representations drastically reduce their space requirements while allowing their efficient navigation in compressed form. While the most basic navigation operation is to retrieve the successors of a node, several important Web graph algorithms require support for extended queries, such as finding the predecessors of a node, checking the presence of a link, or retrieving links between ranges of nodes. Those are seldom supported by compressed graph representations.This paper presents the k2-tree, a novel Web graph representation based on a compact tree structure that takes advantage of large empty areas of the adjacency matrix of the graph. The representation not only retrieves successors and predecessors in symmetric fashion, but also it is particularly efficient to check for specific links between nodes, or between ranges of nodes, or to list the links between ranges. Compared to the best representations in the literature supporting successor and predecessor queries, our technique offers the least space usage (1–3 bits per link) while supporting fast navigation to predecessors and successors (28μs per neighbor retrieved) and sharply outperforming the others on the extended queries. The representation is also of general interest and can be used to compress other kinds of graphs and data structures.  相似文献   

6.
7.
The need to store and query a set of strings – a string dictionary – arises in many kinds of applications. While classically these string dictionaries have accounted for a small share of the total space budget (e.g., in Natural Language Processing or when indexing text collections), recent applications in Web engines, Semantic Web (RDF) graphs, Bioinformatics, and many others handle very large string dictionaries, whose size is a significant fraction of the whole data. In these cases, string dictionary management is a scalability issue by itself. This paper focuses on the problem of managing large static string dictionaries in compressed main memory space. We revisit classical solutions for string dictionaries like hashing, tries, and front-coding, and improve them by using compression techniques. We also introduce some novel string dictionary representations built on top of recent advances in succinct data structures and full-text indexes. All these structures are empirically compared on a heterogeneous testbed formed by real-world string dictionaries. We show that the compressed representations may use as little as 5% of the original dictionary size, while supporting lookup operations within a few microseconds. These numbers outperform the state-of-the-art space/time tradeoffs in many cases. Furthermore, we enhance some representations to provide prefix- and substring-based searches, which also perform competitively. The results show that compressed string dictionaries are a useful building block for various data-intensive applications in different domains.  相似文献   

8.
9.
We present a four-stage algorithm that updates the Burrows–Wheeler Transform of a text TT, when this text is modified. The Burrows–Wheeler Transform is used by many text compression applications and some self-index data structures. It operates by reordering the letters of a text TT to obtain a new text bwt(T)bwt(T) which can be better compressed.  相似文献   

10.

In this paper, an approach has been made to produce a compressed audio without losing any information. The proposed scheme is fabricated with the help of dynamic cluster quantization followed by Burrows Wheeler Transform (BWT) and Huffman coding. The encoding algorithm has been designed in two phases, i.e., dynamic cluster selection (of sampled audio) followed by dynamic bit selection for determining quantization level of individual cluster. Quantization level of each cluster is selected dynamically based on mean square quantization error (MSQE). Bit stream is further compressed by applying Burrows Wheeler Transform (BWT) and Huffman code respectively. Experimental results are supported with current state-of-the-art in audio quality analysis (like statistical parameters (compression ratio, space savings, SNR, PSNR) along with other parameters (encoding time, decoding time, Mean Opinion Score (MOS) and entropy) and compared with other existing techniques.

  相似文献   

11.
A recent trend in stringology explores the possibility of utilizing text compression to speed up similarity computation between strings. In this line of investigation, run-length encoding is one of the earliest studied compression schemes. Despite its simple coding nature, the only positive result before this work is the computation of the in-del distance (dual of longest common subsequence), which requires O(mnlogmn) time, where m and n denote the number of runs of the input strings. The worst-case time complexity of computing the edit distance between two run-length encoded strings still depends on the uncompressed string lengths. In this paper, we break the foundational gap by providing its first “fully compressed” algorithm whose running time depends solely on the compressed string lengths. Specifically, given two strings, compressed into m and n runs, mn, we present an O(mn 2)-time algorithm for computing the edit distance of the strings. Our approach also yields the first fully compressed solution to approximate matching of a pattern of m runs in a text of n runs in O(mn 2) time.  相似文献   

12.
We propose a compact, dimension-independent data structure for manifold, non-manifold and non-regular simplicial complexes, that we call the Generalized Indexed Data Structure with Adjacencies (IA?data structure). It encodes only top simplices, i.e. the ones that are not on the boundary of any other simplex, plus a suitable subset of the adjacency relations. We describe the IA? data structure in arbitrary dimensions, and compare the storage requirements of its 2D and 3D instances with both dimension-specific and dimension-independent representations. We show that the IA? data structure is more cost effective than other dimension-independent representations and is even slightly more compact than the existing dimension-specific ones. We present efficient algorithms for navigating a simplicial complex described as an IA? data structure. This shows that the IA? data structure allows retrieving all topological relations of a given simplex by considering only its local neighborhood and thus it is a more efficient alternative to incidence-based representations when information does not need to be encoded for boundary simplices.  相似文献   

13.
Y. Nekrich 《Algorithmica》2007,49(2):94-108
In this paper we present new space efficient dynamic data structures for orthogonal range reporting. The described data structures support planar range reporting queries in time O(log n+klog log (4n/(k+1))) and space O(nlog log n), or in time O(log n+k) and space O(nlog  ε n) for any ε>0. Both data structures can be constructed in O(nlog n) time and support insert and delete operations in amortized time O(log 2 n) and O(log nlog log n) respectively. These results match the corresponding upper space bounds of Chazelle (SIAM J. Comput. 17, 427–462, 1988) for the static case. We also present a dynamic data structure for d-dimensional range reporting with search time O(log  d−1 n+k), update time O(log  d n), and space O(nlog  d−2+ε n) for any ε>0. The model of computation used in our paper is a unit cost RAM with word size log n. A preliminary version of this paper appeared in the Proceedings of the 21st Annual ACM Symposium on Computational Geometry 2005. Work partially supported by IST grant 14036 (RAND-APX).  相似文献   

14.
The Burrows–Wheeler Transform (BWT ) produces a permutation of a string X, denoted X?, by sorting the n cyclic rotations of X into full lexicographical order and taking the last column of the resulting n×n matrix to be X?. The transformation is reversible in time. In this paper, we consider an alteration to the process, called k‐BWT , where rotations are only sorted to a depth k. We propose new approaches to the forward and reverse transform, and show that the methods are efficient in practice. More than a decade ago, two algorithms were independently discovered for reversing k‐BWT , both of which run in time. Two recent algorithms have lowered the bounds for the reverse transformation to and, respectively. We examine the practical performance for these reversal algorithms. We find that the original approach is most efficient in practice, and investigates new approaches, aimed at further speeding reversal, which store precomputed context boundaries in the compressed file. By explicitly encoding the context boundaries, we present an reversal technique that is both efficient and effective. Finally, our study elucidates an inherently cache‐friendly – and hitherto unobserved – behavior in the reverse k‐BWT , which could lead to new applications of the k‐BWT transform. In contrast to previous empirical studies, we show that the partial transform can be reversed significantly faster than the full transform, without significantly affecting compression effectiveness. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

15.
Given a collection of points in the plane, pick an arbitrary horizontal segment and move it vertically until it hits one of the points (if at all). This form ofsegment-dragging is a common operation in computer graphics and motion-planning, it can also serve as a building block for multidimensional data structures. This note describes a new approach to segment-dragging which yields a simple and efficient solution. The data structure requiresO(n) storage andO(n logn) preprocessing time, and each query can be answered inO(logn) time, wheren is the number of points in the collection. The method is best understood as the end result of a sequence of transformations applied to a simple but inefficient starting solution.This work was started while the author was a visiting professor at Ecole Normale Supérieure, Paris, France.  相似文献   

16.
The most natural kinetic data structure for maintaining the maximum of a collection of continuously changing numbers is the kinetic heap. Basch, Guibas, and Ramkumar proved that the maximum number of events processed by a kinetic heap with n numbers changing as linear functions of time is O(nlog2n) and . We prove that this number is actually Θ(nlogn). In the kinetic heap, a linear number of events are stored in a priority queue, consequently, it takes O(logn) time to determine the next event at each iteration. We also present a modified version of the kinetic heap that processes O(nlogn/loglogn) events, with the same O(logn) time complexity to determine the next event.  相似文献   

17.
Let S be a finite set of n points in d-dimensional space. S is α(n)-partitionable if there exists a set of d mutually intersecting hyperplanes that divide d-space into 2d open regions, no 2d ? 1 of which together contain more than α(n) points of S. Willard (1982) has shown that every set in 2-space is 34n-partitionable. Yao (1983) has shown that every set in 3-space is 2324n-partitionable. It is shown here that there exist sets S of arbitrary cardinality in d-space, d ? 5, for which d2 + 1 open regions together contain at least n ? d2 points of S, in any partition by d intersecting hyperplanes. Further, at least 2d ? d2 ? 1 open regions contain no points of S. This implies that the powerful balanced quad and oct-trees introduced by the above authors may not be generalized to balanced 2d-trees in dimension at least 5.  相似文献   

18.
We develop a cache-oblivious data structure for storing a set S of N axis-aligned rectangles in the plane, such that all rectangles in S intersecting a query rectangle or point can be found efficiently. Our structure is an axis-aligned bounding-box hierarchy and as such it is the first cache-oblivious R-tree with provable performance guarantees. If no point in the plane is contained in more than a constant number of rectangles in S, we can construct, for any constant ε, a structure that answers a rectangle query using \(O(\sqrt{N/B}+T/B)\) memory transfers and a point query using O((N/B) ε ) memory transfers, where T is the number of reported rectangles and B is the block size of memory transfers between any two levels of a multilevel memory hierarchy. We also develop a variant of our structure that achieves the same performance on input sets with arbitrary overlap among the rectangles. The rectangle query bound matches the bound of the best known linear-space cache-aware structure.  相似文献   

19.
20.
The dictionary matching problem seeks all locations in a given text that match any of the patterns in a given dictionary. Efficient algorithms for dictionary matching scan the text once, searching for all patterns simultaneously. Existing algorithms that solve the 2-dimensional dictionary matching problem all require working space proportional to the size of the dictionary. This paper presents the first efficient 2-dimensional dictionary matching algorithm that operates in small space. Given d patterns, D={P 1,…,P d }, each of size m×m, and a text T of size n×n, our algorithm finds all occurrences of P i , 1≤id, in T. The preprocessing of the dictionary forms a compressed self-index of the patterns, after which the original dictionary may be discarded. Our algorithm uses O(dmlogdm) extra bits of space. The time complexity of our algorithm is close to linear, O(dm 2+n 2 τlogσ), where τ is the time it takes to access a character in the compressed self-index and σ is the size of the alphabet. Using recent results τ is at most sub-logarithmic.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号