首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 151 毫秒
1.
We present a new approach based on anagram hashing to handle globally the lexical variation in large and noisy text collections. Lexical variation addressed by spelling correction systems is primarily typographical variation. This is typically handled in a local fashion: given one particular text string some system of retrieving near-neighbors is applied, where near-neighbors are other text strings that differ from the particular string by a given number of characters. The difference in characters between the original string and one of its retrieved near-neighbors constitutes a particular character confusion. We present a global way of performing this action: for all possible particular character confusions given a particular edit distance, we sequentially identify all the pairs of text strings in the text collection that display a particular confusion. We work on large digitized corpora, which contain lexical variation due to both the OCR process and typographical or typesetting error and show that all these types of variation can be handled equally well in the framework we present. The character confusion-based prototype of Text-Induced Corpus Clean-up (ticcl) is compared to its focus word-based counterpart and evaluated on 6 years’ worth of digitized Dutch Parliamentary documents. The character confusion approach is shown to gain an order of magnitude in speed on its word-based counterpart on large corpora. Insights gained about the useful contribution of global corpus variation statistics are shown to also benefit the more traditional word-based approach to spelling correction. Final tests on a held-out set comprising the 1918 edition of the Dutch daily newspaper ‘Het Volk’ show that the system is not sensitive to domain variation.  相似文献   

2.
In this paper we address the problem of defining a measure of diversity for a population of individuals whose genome can be subjected to major reorganizations during the evolutionary process. To this end, we introduce a measure of diversity for populations of strings of variable length defined on a finite alphabet, and from this measure we derive a semi-metric distance between pairs of strings. The definitions are based on counting the number of substrings of the strings, considered first separately and then collectively. This approach is related to the concept of linguistic complexity, whose definition we generalize from single strings to populations. Using the substring count approach we also define a new kind of Tanimoto distance between strings. We show how to extend the approach to representations that are not based on strings and, in particular, to the tree-based representations used in the field of genetic programming. We describe how suffix trees can allow these measures and distances to be implemented with a computational cost that is linear in both space and time relative to the length of the strings and the size of the population. The definitions were devised to assess the diversity of populations having genomes of variable length and variable structure during evolutionary computation runs, but applications in quantitative genomics, proteomics, and pattern recognition can be also envisaged.  相似文献   

3.

Graphs are commonly used to express the communication of various data. Faced with uncertain data, we have probabilistic graphs. As a fundamental problem of such graphs, clustering has many applications in analyzing uncertain data. In this paper, we propose a novel method based on ensemble clustering for large probabilistic graphs. To generate ensemble clusters, we develop a set of probable possible worlds of the initial probabilistic graph. Then, we present a probabilistic co-association matrix as a consensus function to integrate base clustering results. It relies on co-occurrences of node pairs based on the probability of the corresponding common cluster graphs. Also, we apply two improvements in the steps before and after of ensembles generation. In the before step, we append neighborhood information based on node features to the initial graph to achieve a more accurate estimation of the probability between the nodes. In the after step, we use supervised metric learning-based Mahalanobis distance to automatically learn a metric from ensemble clusters. It aims to gain crucial features of the base clustering results. We evaluate our work using five real-world datasets and three clustering evaluation metrics, namely the Dunn index, Davies–Bouldin index, and Silhouette coefficient. The results show the impressive performance of clustering large probabilistic graphs.

  相似文献   

4.
Many pattern recognition algorithms are based on the nearest-neighbour search and use the well-known edit distance, for which the primitive edit costs are usually fixed in advance. In this article, we aim at learning an unbiased stochastic edit distance in the form of a finite-state transducer from a corpus of (input, output) pairs of strings. Contrary to the other standard methods, which generally use the Expectation Maximisation algorithm, our algorithm learns a transducer independently on the marginal probability distribution of the input strings. Such an unbiased way to proceed requires to optimise the parameters of a conditional transducer instead of a joint one. We apply our new model in the context of handwritten digit recognition. We show, carrying out a large series of experiments, that it always outperforms the standard edit distance.  相似文献   

5.
In this paper we introduce a biologically inspired distributed computing model called networks of evolutionary processors with parallel string rewriting rules (NEPPS), which is a variation of the hybrid networks of evolutionary processors introduced by Martin-Vide et al. Such a network contains simple processors that are located in the nodes of a virtual graph. Each processor has strings (each string having multiple copies) and string rewriting rules. The rules are applied parallely on the strings. After the strings have been rewritten, they are communicated among the processors through filters. We show that we can theoretically break the DES (data encryption standard), which is the most widely used cryptosystem, using NEPPS. We prove that, given an arbitrary <plain-text, cipher-text> pair, one can recover the DES key in a constant number of steps.  相似文献   

6.
During the past few years, several works have been done to derive string kernels from probability distributions. For instance, the Fisher kernel uses a generative model M (e.g. a hidden Markov model) and compares two strings according to how they are generated by M. On the other hand, the marginalized kernels allow the computation of the joint similarity between two instances by summing conditional probabilities. In this paper, we adapt this approach to edit distance-based conditional distributions and we present a way to learn a new string edit kernel. We show that the practical computation of such a kernel between two strings x and x built from an alphabet Σ requires (i) to learn edit probabilities in the form of the parameters of a stochastic state machine and (ii) to calculate an infinite sum over Σ* by resorting to the intersection of probabilistic automata as done for rational kernels. We show on a handwritten character recognition task that our new kernel outperforms not only the state of the art string kernels and string edit kernels but also the standard edit distance used by a neighborhood-based classifier.  相似文献   

7.
The probabilistic linguistic term set is a powerful tool to express and characterize people’s cognitive complex information and thus has obtained a great development in the last several years. To better use the probabilistic linguistic term sets in decision making, information measures such as the distance measure, similarity measure, entropy measure and correlation measure should be defined. However, as an important kind of information measure, the inclusion measure has not been defined by scholars. This study aims to propose the inclusion measure for probabilistic linguistic term sets. Formulas to calculate the inclusion degrees are put forward Then, we introduce the normalized axiomatic definitions of the distance, similarity and entropy measures of probabilistic linguistic term sets to construct a unified framework of information measures for probabilistic linguistic term sets. Based on these definitions, we present the relationships and transformation functions among the distance, similarity, entropy and inclusion measures. We believe that more formulas to calculate the distance, similarity, inclusion degree and entropy can be induced based on these transformation functions. Finally, we put forward an orthogonal clustering algorithm based on the inclusion measure and use it in classifying cities in the Economic Zone of Chengdu Plain, China.  相似文献   

8.
We study the computational power of systems where information is stored in independent strings and each computational step consists of exchanging information between randomly chosen pairs. To this end we introduce a population genetics model in which the operators of selection and inheritance are effectively computable (in polynomial time on probabilistic Turing machines). We show that such systems are as powerful as the usual models of parallel computations, namely they can simulate polynomial space computations in polynomially many steps. We also show that the model has the same power if the recombination rules for strings are very simple (context sensitive crossing over).  相似文献   

9.
We consider two characterisations of the may and must testing preorders for a probabilistic extension of the finite π-calculus: one based on notions of probabilistic weak simulations, and the other on a probabilistic extension of a fragment of Milner–Parrow–Walker modal logic for the π-calculus. We base our notions of simulations on similar concepts used in previous work for probabilistic CSP. However, unlike the case with CSP (or other non-value-passing calculi), there are several possible definitions of simulation for the probabilistic π-calculus, which arise from different ways of scoping the name quantification. We show that in order to capture the testing preorders, one needs to use the “earliest” simulation relation (in analogy to the notion of early (bi)simulation in the non-probabilistic case). The key ideas in both characterisations are the notion of a “characteristic formula” of a probabilistic process, and the notion of a “characteristic test” for a formula. As in an earlier work on testing equivalence for the π-calculus by Boreale and De Nicola, we extend the language of the π-calculus with a mismatch operator, without which the formulation of a characteristic test will not be possible.  相似文献   

10.
A 1976 theorem of Chaitin can be used to show that arbitrarily dense sets of lengths n have a paucity of trivial strings (only a bounded number of strings of length n having trivially low plain Kolmogorov complexities). We use the probabilistic method to give a new proof of this fact. This proof is much simpler than previously published proofs, and it gives a tighter paucity bound.  相似文献   

11.
Grasping is a fundamental skill for robots which work for manipulation tasks. Grasping of unknown objects remains a big challenge. Precision grasping of unknown objects is even harder. Due to imperfection of sensor measurements and lack of prior knowledge of objects, robots have to handle the uncertainty effectively. In previous work (Chen and Wichert 2015), we use a probabilistic framework to tackle precision grasping of model-based objects. In this paper, we extend the probabilistic framework to tackle the problem of precision grasping of unknown objects. We first propose an object model called probabilistic signed distance function (p-SDF) to represent unknown object surface. p-SDF models measurement uncertainty explicitly and allows measurement from multiple sensors to be fused in real time. Based on the surface representation, we propose a model to evaluate the likelihood of grasp success for antipodal grasps. This model uses four heuristics to model the condition of force closure and perceptual uncertainty. A two step simulated annealing approach is further proposed to search and optimize a precision grasp. We use the object representation as a bridge to unify grasp synthesis and grasp execution. Our grasp execution is performed in a closed-loop, so that robots can actively reduce the uncertainty and react to external perturbations during a grasping process. We perform extensive grasping experiments using real world challenging objects and demonstrate that our method achieves high robustness and accuracy in grasping unknown objects.  相似文献   

12.
Finding similar substrings/substructures is a central task in analyzing huge string data such as genome sequences, Web documents, log data, feature vectors of pictures, photos, videos, etc. Although the existence of polynomial time algorithms for such problems is trivial since the number of substrings is bounded by the square of their lengths, straightforward algorithms do not work for huge databases because of their high degree order of the computation time. This paper addresses the problem of finding pairs of strings with small Hamming distances from huge databases composed of short strings of a fixed length. Comparison of long strings can be solved by inputting all their substrings of fixed length so that we can find candidates of similar non-short substrings. We focus on the practical efficiency of algorithms, and propose an algorithm that runs in time almost linear in the input/output size. We prove that the computation time of its variant is linear in the database size when the length of the short strings is constant, and computational experiments for genome sequences and Web texts show its practical efficiency. Slight modifications adapt to the edit distance and mismatch tolerance computation. An implementation is available at the author’s homepage.  相似文献   

13.
G. Tremblay  F. Champagne 《Software》2007,37(2):207-230
Musical dictations for ear training and training in music writing form a key practice of basic musical training. Marking students' dictation exercises for large groups of students can require a lot of work. In this paper, we present a tool, called CADiM, that can help automate the marking of such musical dictations. The edit distance, which computes the similarity between any two strings, has been used in various areas such as string/text analysis, protein/genome matching in bio‐computing and musical applications, for example music retrieval or musicological analysis. CADiM's marking algorithm is based on an earlier edit distance proposed for musical sequences, but adapted to reflect the marking heuristic used by a domain expert's specific approach to musical training. Computing an edit distance on musical scores requires using an appropriate representation. More precisely, given our specific context, a symbolic representation is required. We use MusicXML, an XML application for standard Western music notation. Given a Document Type Definition for MusicXML, existing Java tools can generate a MusicXML parser. Such a parser, given appropriate input files, then generates an intermediate form (DOM object) on which analyses and transformations are performed in order to compute the edit distance. In turn, the edit distance is used to give a mark as well as identify the key errors. CADiM has been applied to a number of test cases and the results compared with those obtained by a domain expert. Overall, the results are promising, namely, only 3% difference between the domain expert's marks and those produced by CADiM. Copyright © 2006 John Wiley & Sons, Ltd.  相似文献   

14.
A fully probabilistic approach to reconstructing Gaussian graphical models from distance data is presented. The main idea is to extend the usual central Wishart model in traditional methods to using a likelihood depending only on pairwise distances, thus being independent of geometric assumptions about the underlying Euclidean space. This extension has two advantages: the model becomes invariant against potential bias terms in the measurements, and can be used in situations which on input use a kernel- or distance matrix, without requiring direct access to the underlying vectors. The latter aspect opens up a huge new application field for Gaussian graphical models, as network reconstruction is now possible from any Mercer kernel, be it on graphs, strings, probabilities or more complex objects. We combine this likelihood with a suitable prior to enable Bayesian network inference. We present an efficient MCMC sampler for this model and discuss the estimation of module networks. Experiments depict the high quality and usefulness of the inferred networks.  相似文献   

15.
16.
The consensus (string) problem is finding a representative string, called a consensus, of a given set S of strings. In this paper we deal with consensus problems considering both distance sum and radius, where the distance sum is the sum of (Hamming) distances from the strings in S to the consensus and the radius is the longest (Hamming) distance from the strings in S to the consensus. Although there have been results considering either distance sum or radius, there have been no results considering both, to the best of our knowledge.We present the first algorithms for two consensus problems considering both distance sum and radius for three strings: one problem is to find an optimal consensus minimizing both distance sum and radius. The other problem is to find a bounded consensus such that the distance sum is at most s and the radius is at most r for given constants s and r. Our algorithms are based on characterization of the lower bounds of distance sum and radius, and thus they solve the problems efficiently. Both algorithms run in linear time.  相似文献   

17.
Weak Probabilistic Anonymity   总被引:1,自引:0,他引:1  
Anonymity means that the identity of the user performing a certain action is maintained secret. The protocols for ensuring anonymity often use random mechanisms which can be described probabilistically. In this paper we propose a notion of weak probabilistic anonymity, where weak refers to the fact that some amount of probabilistic information may be revealed by the protocol. This information can be used by an observer to infer the likeliness that the action has been performed by a certain user. The aim of this work is to study the degree of anonymity that the protocol can still ensure, despite the leakage of information.We illustrate our ideas by using the example of the dining cryptographers with biased coins. We consider both the cases of nondeterministic and probabilistic users. Correspondingly, we propose two notions of weak anonymity and we investigate their respective dependencies on the biased factor of the coins.  相似文献   

18.
Moritz G. Maass 《Algorithmica》2006,46(3-4):469-491
For the exact search of a pattern of length m in a database of n strings the trie data structure allows an optimal lookup time of O(m). If mismatches are allowed between the pattern and the database strings, no such structure with reasonable size is known. Some work can be saved using a trie and running times superior to the comparison with every string in the database can be achieved. We investigate a comparison-based model where matches and mismatches are defined between pairs of characters. When comparing two characters, let q be the probability of an error. Between any two strings we bound the number of errors by d, which we consider a function of n. We study the average-case complexity of the number of comparisons for searching in a trie in dependence of the parameters q and d. Our analysis yields the asymptotic behavior for memoryless sources with uniform probabilities. It turns out that there is a jump in the average-case complexity at certain thresholds for q and d. Our results can be applied for any comparison-based error model, for instance, Hamming distance, don't cares, or geometric character distances.  相似文献   

19.
Finding a sequence of edit operations that transforms one string of symbols into another with the minimum cost is a well-known problem. The minimum cost, or edit distance, is a widely used measure of the similarity of two strings. An important parameter of this problem is the cost function, which specifies the cost of each insertion, deletion, and substitution. We show that cost functions having the same ratio of the sum of the insertion and deletion costs divided by the substitution cost yield the same minimum cost sequences of edit operations. This leads to a partitioning of the universe of cost functions into equivalence classes. Also, we show the relationship between a particular set of cost functions and the longest common subsequence of the input strings. This work was supported in part by the U.S. Department of Defense and the U.S. Department of Energy.  相似文献   

20.
We present techniques for improving performance driven facial animation, emotion recognition, and facial key-point or landmark prediction using learned identity invariant representations. Established approaches to these problems can work well if sufficient examples and labels for a particular identity are available and factors of variation are highly controlled. However, labeled examples of facial expressions, emotions and key-points for new individuals are difficult and costly to obtain. In this paper we improve the ability of techniques to generalize to new and unseen individuals by explicitly modeling previously seen variations related to identity and expression. We use a weakly-supervised approach in which identity labels are used to learn the different factors of variation linked to identity separately from factors related to expression. We show how probabilistic modeling of these sources of variation allows one to learn identity-invariant representations for expressions which can then be used to identity-normalize various procedures for facial expression analysis and animation control. We also show how to extend the widely used techniques of active appearance models and constrained local models through replacing the underlying point distribution models which are typically constructed using principal component analysis with identity–expression factorized representations. We present a wide variety of experiments in which we consistently improve performance on emotion recognition, markerless performance-driven facial animation and facial key-point tracking.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号