首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The top-k similarity joins have been extensively studied and used in a wide spectrum of applications such as information retrieval, decision making, spatial data analysis and data mining. Given two sets of objects $\mathcal U$ and $\mathcal V$ , a top-k similarity join returns k pairs of most similar objects from $\mathcal U \times \mathcal V$ . In the conventional model of top-k similarity join processing, an object is usually regarded as a point in a multi-dimensional space and the similarity is measured by some simple distance metrics like Euclidean distance. However, in many applications an object may be described by multiple values (instances) and the conventional model is not applicable since it does not address the distributions of object instances. In this paper, we study top-k similarity join over multi-valued objects. We apply two types of quantile based distance measures, ?-quantile distance and ?-quantile group-base distance, to explore the relative instance distribution among the multiple instances of objects. Efficient and effective techniques to process top-k similarity joins over multi-valued objects are developed following a filtering-refinement framework. Novel distance, statistic and weight based pruning techniques are proposed. Comprehensive experiments on both real and synthetic datasets demonstrate the efficiency and effectiveness of our techniques.  相似文献   

2.
This paper presents the design and implementation of the Neuma platform, a digital library devoted to the preservation and dissemination of symbolic music content (scores). Neuma is open to musicologists, musicians, and music publishers. It consists of a repository dedicated to the storage of large collections of digital scores, where users/applications can upload their documents. It also proposes services to publish, annotate, query, transform, and analyze scores. The long-term goal of the project is to enable an open and collaborative space where musician communities will be able to share music in symbolic notation. The project is organized around the French IRPMF institute (BnF–CNRS) which chooses and produces collections (or corpora), collects and organizes user requirements, and validates new publications. We describe the architecture of Neuma and develop some of its salient features: score modeling annotations, search and transformation language, collaborative and community tools, and digital rights preservation. We illustrate these features with two collections published by Neuma, and discuss the impact of such on-line score collections from a musicological perspective.  相似文献   

3.
As powerful tools, machine learning and data mining techniques have been widely applied in various areas. However, in many real-world applications, besides establishing accurate black box predictors, we are also interested in white box mechanisms, such as discovering predictive patterns in data that enhance our understanding of underlying physical, biological and other natural processes. For these purposes, sparse representation and its variations have been one of the focuses. More recently, structural sparsity has attracted increasing attentions. In previous research, structural sparsity was often achieved by imposing convex but non-smooth norms such as ${\ell _{2}/\ell _{1}}$ and group ${\ell _{2}/\ell _{1}}$ norms. In this paper, we present the explicit ${\ell _2/\ell _0}$ and group ${\ell _2/\ell _0}$ norm to directly approach the structural sparsity. To tackle the problem of intractable ${\ell _2/\ell _0}$ optimizations, we develop a general Lipschitz auxiliary function that leads to simple iterative algorithms. In each iteration, optimal solution is achieved for the induced subproblem and a guarantee of convergence is provided. Furthermore, the local convergent rate is also theoretically bounded. We test our optimization techniques in the multitask feature learning problem. Experimental results suggest that our approaches outperform other approaches in both synthetic and real-world data sets.  相似文献   

4.
To study the data dependencies over heterogeneous data in dataspaces, we define a general dependency form, namely comparable dependencies (CDS), which specifies constraints on comparable attributes. It covers the semantics of a broad class of dependencies in databases, including functional dependencies (FDS), metric functional dependencies (MFDS), and matching dependencies (MDS). As we illustrated, comparable dependencies are useful in real practice of dataspaces, such as semantic query optimization. Due to heterogeneous data in dataspaces, the first question, known as the validation problem, is to tell whether a dependency (almost) holds in a data instance. Unfortunately, as we proved, the validation problem with certain error or confidence guarantee is generally hard. In fact, the confidence validation problem is also NP-hard to approximate to within any constant factor. Nevertheless, we develop several approaches for efficient approximation computation, such as greedy and randomized approaches with an approximation bound on the maximum number of violations that an object may introduce. Finally, through an extensive experimental evaluation on real data, we verify the superiority of our methods.  相似文献   

5.
The class ${\mathcal{SLUR}}$ (Single Lookahead Unit Resolution) was introduced in Schlipf et al. (Inf Process Lett 54:133–137, 1995) as an umbrella class for efficient (poly-time) SAT solving, with linear-time SAT decision, while the recognition problem was not considered. ?epek et al. (2012) and Balyo et al. (2012) extended this class in various ways to hierarchies covering all of CNF (all clause-sets). We introduce a hierarchy ${\mathcal{SLUR}}_k$ which we argue is the natural “limit” of such approaches. The second source for our investigations is the class ${\mathcal{UC}}$ of unit-refutation complete clause-sets, introduced in del Val (1994) as a target class for knowledge compilation. Via the theory of “hardness” of clause-sets as developed in Kullmann (1999), Kullmann (Ann Math Artif Intell 40(3–4):303–352, 2004) and Ansótegui et al. (2008) we obtain a natural generalisation ${\mathcal{UC}}_k$ , containing those clause-sets which are “unit-refutation complete of level k”, which is the same as having hardness at most k. Utilising the strong connections to (tree-)resolution complexity and (nested) input resolution, we develop basic methods for the determination of hardness (the level k in ${\mathcal{UC}}_k$ ). A fundamental insight now is that ${\mathcal{SLUR}}_k = {\mathcal{UC}}_k$ holds for all k. We can thus exploit both streams of intuitions and methods for the investigations of these hierarchies. As an application we can easily show that the hierarchies from ?epek et al. (2012) and Balyo et al. (2012) are strongly subsumed by ${\mathcal{SLUR}}_k$ . Finally we consider the problem of “irredundant” clause-sets in ${\mathcal{UC}}_k$ . For 2-CNF we show that strong minimisations are possible in polynomial time, while already for (very special) Horn clause-sets minimisation is NP-complete. We conclude with an extensive discussion of open problems and future directions. We envisage the concepts investigated here to be the starting point for a theory of good SAT translations, which brings together the good SAT-solving aspects from ${\mathcal{SLUR}}$ together with the knowledge-representation aspects from ${\mathcal{UC}}$ , and expands this combination via notions of “hardness”.  相似文献   

6.
Query result clustering has attracted considerable attention as a means of providing users with a concise overview of results. However, little research effort has been devoted to organizing the query results for entities which refer to real-world concepts, e.g., people, products, and locations. Entity-level result clustering is more challenging because diverse similarity notions between entities need to be supported in heterogeneous domains, e.g., image resolution is an important feature for cameras, but not for fruits. To address this challenge, we propose a hybrid relationship clustering algorithm, called Hydra, using co-occurrence and numeric features. Algorithm Hydra captures diverse user perceptions from co-occurrence and disambiguates different senses using feature-based similarity. In addition, we extend Hydra into ${\mathsf{Hydra }_\mathsf{gData }}$ Hydra gData with different sources, i.e., entity types and crowdsourcing. Experimental results show that the proposed algorithms achieve effectiveness and efficiency in real-life and synthetic datasets.  相似文献   

7.
Learning from high-dimensional data is usually quite challenging, as captured by the well-known phrase curse of dimensionality. Data analysis often involves measuring the similarity between different examples. This sometimes becomes a problem, as many widely used metrics tend to concentrate in high-dimensional feature spaces. The reduced contrast makes it more difficult to distinguish between close and distant points, which renders many traditional distance-based learning methods ineffective. Secondary distances based on shared neighbor similarities have recently been proposed as one possible solution to this problem. However, these initial metrics failed to take hubness into account. Hubness is a recently described aspect of the dimensionality curse, and it affects all sorts of $k$ -nearest neighbor learning methods in severely negative ways. This paper is the first to discuss the impact of hubs on forming the shared neighbor similarity scores. We propose a novel, hubness-aware secondary similarity measure $simhub_s$ and an extensive experimental evaluation shows it to be much more appropriate for high-dimensional data classification than the standard $simcos_s$ measure. The proposed similarity changes the underlying $k$ NN graph in such a way that it reduces the overall frequency of label mismatches in $k$ -neighbor sets and increases the purity of occurrence profiles, which improves classifier performance. It is a hybrid measure, which takes into account both the supervised and the unsupervised hubness information. The analysis shows that both components are useful in their own ways and that the measure is therefore properly defined. This new similarity does not increase the overall computational cost, and the improvement is essentially ‘free’.  相似文献   

8.
We address the question of the weakest failure detector to circumvent the impossibility of $(2n-2)$ -renaming in a system of up to $n$ participating processes. We derive that in a restricted class of eventual failure detectors there does not exist a single weakest oracle, but a weakest family of oracles $\zeta _n$ : every two oracles in $\zeta _n$ are incomparable, and every oracle that allows for solving renaming provides at least as much information about failures as one of the oracles in $\zeta _n$ . As a by product, we obtain one more evidence that renaming is strictly easier to solve than set agreement.  相似文献   

9.
In this paper, we focus on the concept classes \({\mathcal {C}}_{{\mathcal{N}}}\) induced by Bayesian networks. The relationship between two-dimensional values induced by these concept classes is studied, one of which is the VC-dimension of the concept class \({\mathcal {C}}_{\cal {N}},\) denoted as \(VCdim({\mathcal {N}}), \) and other is the smallest dimensional of Euclidean spaces into which \({\mathcal {C}}_{{\mathcal {N}}}\) can be embedded, denoted as \(Edim({\mathcal {N}}). \) As a main result, we show that the two-dimensional values are equal for the Bayesian networks with n ≤ 4 variables, called the VE-dimension for that Bayesian networks.  相似文献   

10.
Despite a large body of work on XPath query processing in relational environment, systematic study of queries containing not-predicates have received little attention in the literature. Particularly, several xml supports of industrial-strength commercial rdbms fail to efficiently evaluate such queries. In this paper, we present an efficient and novel strategy to evaluate not -twig queries in a tree-unaware relational environment. not -twig queries are XPath queries with ancestor–descendant and parent–child axis and contain one or more not-predicates. We propose a novel Dewey-based encoding scheme called Andes (ANcestor Dewey-based Encoding Scheme), which enables us to efficiently filter out elements satisfying a not-predicate by comparing their ancestor group identifiers. In this approach, a set of elements under the same common ancestor at a specific level in the xml tree is assigned same ancestor group identifier. Based on this scheme, we propose a novel sql translation algorithm for not-twig query evaluation. Experiments carried out confirm that our proposed approach built on top of an off-the-shelf commercial rdbms significantly outperforms state-of-the-art relational and native approaches. We also explore the query plans selected by a commercial relational optimizer to evaluate our translated queries in different input cardinality. Such exploration further validates the performance benefits of Andes.  相似文献   

11.
The Voronoi diagram is an important technique for answering nearest-neighbor queries for spatial databases. We study how the Voronoi diagram can be used for uncertain spatial data, which are inherent in scientific and business applications. Specifically, we propose the Uncertain-Voronoi diagram (or UV-diagram), which divides the data space into disjoint “UV-partitions”. Each UV-partition $P$ is associated with a set $S$ of objects, such that any point $q$ located in $P$ has the set $S$ as its nearest neighbor with nonzero probabilities. The UV-diagram enables queries that return objects with nonzero chances of being the nearest neighbor (NN) of a given point $q$ . It supports “continuous nearest-neighbor search”, which refreshes the set of NN objects of $q$ , as the position of $q$ changes. It also allows the analysis of nearest-neighbor information, for example, to find out the number of objects that are the nearest neighbors of any point in a given area. A UV-diagram requires exponential construction and storage costs. To tackle these problems, we devise an alternative representation of a UV-diagram, by using a set of UV-cells. A UV-cell of an object $o$ is the extent $e$ for which $o$ can be the nearest neighbor of any point $q \in e$ . We study how to speed up the derivation of UV-cells by considering its nearby objects. We also use the UV-cells to design the UV-index, which supports different queries, and can be constructed in polynomial time. We have performed extensive experiments on both real and synthetic data to validate the efficiency of our approaches.  相似文献   

12.
The query complexity of estimating the mean of some [0, 1] variables is understood. Inspired by some work by Carterette et?al. on evaluating retrieval systems, and by Moffat and Zobel??s new proposal for such evaluation, we examine the query complexity of weighted average calculation. In general, determining an answer within accuracy ${\varepsilon}$ , with high probability, requires ${\Omega(\varepsilon^{-2})}$ queries, as the mean is a special case. There is a matching upper bound for the weighted mean. If the weights are a normalized prefix of a divergent series, the same result holds. However, if the weights follow a geometric sequence, a sample of size ${\Omega(\log (1/\varepsilon))}$ suffices. Our principal contribution is the investigation of power-law sequences of weights. We show that if the ith largest weight is proportional to i ?p , for p > 1, then the query complexity is in ${\Omega(\varepsilon^{2/(1-2p)})}$ .  相似文献   

13.
We introduce two new natural decision problems, denoted as ? RATIONAL NASH and ? IRRATIONAL NASH, pertinent to the rationality and irrationality, respectively, of Nash equilibria for (finite) strategic games. These problems ask, given a strategic game, whether or not it admits (i) a rational Nash equilibrium where all probabilities are rational numbers, and (ii) an irrational Nash equilibrium where at least one probability is irrational, respectively. We are interested here in the complexities of ? RATIONAL NASH and ? IRRATIONAL NASH. Towards this end, we study two other decision problems, denoted as NASH-EQUIVALENCE and NASH-REDUCTION, pertinent to some mutual properties of the sets of Nash equilibria of two given strategic games with the same number of players. The problem NASH-EQUIVALENCE asks whether or not the two sets of Nash equilibria coincide; we identify a restriction of its complementary problem that witnesses ? RATIONAL NASH. The problem NASH-REDUCTION asks whether or not there is a so called Nash reduction: a suitable map between corresponding strategy sets of players that yields a Nash equilibrium of the former game from a Nash equilibrium of the latter game; we identify a restriction of NASH-REDUCTION that witnesses ? IRRATIONAL NASH. As our main result, we provide two distinct reductions to simultaneously show that (i) NASH-EQUIVALENCE is co- $\mathcal{NP}$ -hard and ? RATIONAL NASH is $\mathcal{NP}$ -hard, and (ii) NASH-REDUCTION and ? IRRATIONAL NASH are both $\mathcal{NP}$ -hard, respectively. The reductions significantly extend techniques previously employed by Conitzer and Sandholm (Proceedings of the 18th Joint Conference on Artificial Intelligence, pp. 765–771, 2003; Games Econ. Behav. 63(2), 621–641, 2008).  相似文献   

14.
We strengthen a previously known connection between the size complexity of two-way finite automata ( ) and the space complexity of Turing machines (tms). Specifically, we prove that
  • every s-state has a poly(s)-state that agrees with it on all inputs of length ≤s if and only if NL?L/poly, and
  • every s-state has a poly(s)-state that agrees with it on all inputs of length ≤2 s if and only if NLL?LL/polylog.
  • Here, and are the deterministic and nondeterministic , NL and L/poly are the standard classes of languages recognizable in logarithmic space by nondeterministic tms and by deterministic tms with access to polynomially long advice, and NLL and LL/polylog are the corresponding complexity classes for space O(loglogn) and advice length poly(logn). Our arguments strengthen and extend an old theorem by Berman and Lingas and can be used to obtain variants of the above statements for other modes of computation or other combinations of bounds for the input length, the space usage, and the length of advice.  相似文献   

    15.
    In multi-task learning, there are roughly two approaches to discovering representations. The first is to discover task relevant representations, i.e., those that compactly represent solutions to particular tasks. The second is to discover domain relevant representations, i.e., those that compactly represent knowledge that remains invariant across many tasks. In this article, we propose a new approach to multi-task learning that captures domain-relevant knowledge by learning potential-based shaping functions, which augment a task’s reward function with artificial rewards. We address two key issues that arise when deriving potential functions. The first is what kind of target function the potential function should approximate; we propose three such targets and show empirically that which one is best depends critically on the domain and learning parameters. The second issue is the representation for the potential function. This article introduces the notion of $k$ -relevance, the expected relevance of a representation on a sample sequence of $k$ tasks, and argues that this is a unifying definition of relevance of which both task and domain relevance are special cases. We prove formally that, under certain assumptions, $k$ -relevance converges monotonically to a fixed point as $k$ increases, and use this property to derive Feature Selection Through Extrapolation of k-relevance (FS-TEK), a novel feature-selection algorithm. We demonstrate empirically the benefit of FS-TEK on artificial domains.  相似文献   

    16.
    In a sampling problem, we are given an input x∈{0,1} n , and asked to sample approximately from a probability distribution \(\mathcal{D}_{x}\) over \(\operatorname{poly} ( n ) \) -bit strings. In a search problem, we are given an input x∈{0,1} n , and asked to find a member of a nonempty set A x with high probability. (An example is finding a Nash equilibrium.) In this paper, we use tools from Kolmogorov complexity to show that sampling and search problems are “essentially equivalent.” More precisely, for any sampling problem S, there exists a search problem R S such that, if \(\mathcal{C}\) is any “reasonable” complexity class, then R S is in the search version of \(\mathcal{C}\) if and only if S is in the sampling version. What makes this nontrivial is that the same R S works for every  \(\mathcal{C}\) . As an application, we prove the surprising result that SampP=SampBQP if and only if FBPP=FBQP. In other words, classical computers can efficiently sample the output distribution of every quantum circuit, if and only if they can efficiently solve every search problem that quantum computers can solve.  相似文献   

    17.
    In this paper we derive the closed loop form of the Expected Optimal Feedback rule, sometimes called passive learning stochastic control, with time varying parameters. As such this paper extends the work of Kendrick (Stochastic control for economic models, 1981; Stochastic control for economic models, 2002, Chap. 6) where parameters are assumed to vary randomly around a known constant mean. Furthermore, we show that the cautionary myopic rule in Beck and Wieland (J Econ Dyn Control 26:1359–1377, 2002) model, a test bed for comparing various stochastic optimizations approaches, can be cast into this framework and can be treated as a special case of this solution.  相似文献   

    18.
    In this paper, we consider a popular model for collaborative filtering in recommender systems. In particular, we consider both the clustering model, where only users (or items) are clustered, and the co-clustering model, where both users and items are clustered, and further, we assume that some users rate many items (information-rich users) and some users rate only a few items (information-sparse users). When users (or items) are clustered, our algorithm can recover the rating matrix with \(\omega (MK \log M)\) noisy entries while \(MK\) entries are necessary, where \(K\) is the number of clusters and \(M\) is the number of items. In the case of co-clustering, we prove that \(K^2\) entries are necessary for recovering the rating matrix, and our algorithm achieves this lower bound within a logarithmic factor when \(K\) is sufficiently large. Extensive simulations on Netflix and MovieLens data show that our algorithm outperforms the alternating minimization and the popularity-among-friends algorithm. The performance difference increases even more when noise is added to the datasets.  相似文献   

    19.
    In this paper, we introduce a new problem termed query reverse engineering (QRE). Given a database \(D\) and a result table \(T\) —the output of some known or unknown query \(Q\) on \(D\) —the goal of QRE is to reverse-engineer a query \(Q'\) such that the output of query \(Q'\) on database \(D\) (denoted by \(Q'(D)\) ) is equal to \(T\) (i.e., \(Q(D)\) ). The QRE problem has useful applications in database usability, data analysis, and data security. In this work, we propose a data-driven approach, TALOS for Tree-based classifier with At Least One Semantics, that is based on a novel dynamic data classification formulation and extend the approach to efficiently support the three key dimensions of the QRE problem: whether the input query is known/unknown, supporting different query fragments, and supporting multiple database versions.  相似文献   

    20.
    In this paper we study parallel algorithms for the Mesh-of-Processors architecture to solve visibility and related separability problems for sets of simple polygons in the plane. In particular, we present the following algorithms:
  • - AnO( \(\sqrt N\) time algorithm for computing on a Mesh-of-Processors of size N the visibility polygon from a point located in anN-vertex polygon, possibly with holes.
  • -O( \(\sqrt N\) ) time algorithms for computing on a Mesh-of-Processors of sizeN the set of all points on the boundary of anN-vertex polygonP which are visible in a given directiond as well as the visibility hull ofP for a given directiond.
  • - AnO( \(\sqrt N\) ) time algorithm for detecting on a Mesh-of-Processors of size 2N whether twoN-vertex polygons are separable in a given direction and anO( \(\sqrt {MN}\) ) time algorithm for detecting on a Mesh-of-Processors of sizeMN whetherM N-vertex polygons are sequentially separable in a given direction.
  • All proposed algorithms are asymptotically optimal (for the Mesh-of-Processors) with respect to time and number of processors.  相似文献   

    设为首页 | 免责声明 | 关于勤云 | 加入收藏

    Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

    京公网安备 11010802026262号