共查询到20条相似文献,搜索用时 484 毫秒
1.
按照元组描述的实体对其进行组织和查询处理,是一种管理劣质数据的有效方法.考虑到同一个实体的同一属性存在多个描述的值,因此,基于实体的数据库上的连接是支持多个值的相似性连接.与字符串的相似性连接相比较,实体的相似性连接在数据清洗、信息集成、模糊关键字查询、诈骗检测和文本聚集等领域有着更好的应用效果.通过建立双层索引结构,提出了实体数据库上相似性连接算法ES-JOIN.同时,该方法适用于解决集合中字符串模糊匹配的相似性连接问题,而传统的集合相似性连接只针对集合中元素精确匹配的情况.为了加速连接,还提出了过滤措施对算法进行优化,进一步给出了优化算法OPT_ES-JOIN.实验验证了ES-JOIN算法和OPT_ES-JOIN算法具有很好的效率和可扩展性.实验结果表明,过滤措施具有很好的过滤效果. 相似文献
2.
Graphs have been widely used for complex data representation in many real applications, such as social network, bioinformatics, and computer vision. Therefore, graph similarity join has become imperative for integrating noisy and inconsistent data from multiple data sources. The edit distance is commonly used to measure the similarity between graphs. The graph similarity join problem studied in this paper is based on graph edit distance constraints. To accelerate the similarity join based on graph edit distance, in the paper, we make use of a preprocessing strategy to remove the mismatching graph pairs with significant differences. Then a novel method of building indexes for each graph is proposed by grouping the nodes which can be reached in k hops for each key node with structure conservation, which is the k-hop tree based indexing method. As for each candidate pair, we propose a similarity computation algorithm with boundary filtering, which can be applied with good efficiency and effectiveness. Experiments on real and synthetic graph databases also confirm that our method can achieve good join quality in graph similarity join. Besides, the join process can be finished in polynomial time. 相似文献
3.
This paper addresses the distributed stream processing of window-based multi-way join queries considering the semijoin as
a key join operator. In distributed stream processing, data streams arriving at remote sites need to be shipped to the processing
site for query execution. This typically introduces high communication overhead. Our observation is that semijoin, effective
in reducing communication overhead in distributed database query processing, can be also effective in distributed stream query
processing. The challenge, however, lies in the streaming nature of the tuples, as it requires continuous and incremental
processing of an unbounded sequence of tuples instead of one-time processing of a set of stored tuples. This paper describes
our comprehensive work done to address the challenge. Specifically, we first propose a distributed stream join processing
model that handles the issue of network delays introduced from the shipment of data streams, and allows for efficient batch
processing. Then, based on the model, we propose join algorithms in a multi-way join case: first, one-way join algorithms
for different combinations of join placement and join method and, then, multi-way join algorithms assuming linear join ordering.
Regarding the join method, two distributed join methods are introduced: (1) simple join, in which full tuples are forwarded to the query processing site and (2) semijoin-based join, in which partial tuples are forwarded. A semijoin-based join can be executed with different possible semijoin strategies which incur different
communication overheads. We present a complete set of join algorithms considering all possible semijoin strategies, and propose
an optimization algorithm. The join algorithms are executed continuously in an incremental manner as tuples arrive, and never
ship tuples redundantly. The optimization algorithm constructs an efficient multi-way join plan by using a greedy heuristic
which adds to the plan one stream with the minimum join execution cost in each step. Through extensive experiments, we conduct
comparative studies of the performance among the proposed one-way join algorithms and the efficiency of the generated plan
between the optimization algorithm based on the greedy heuristic and the exhaustive search, respectively. 相似文献
4.
A join of two relations in real databases is usually much smaller than their Cartesian product. This means that most of the
combinations of tuples in the crossproduct of the respective relations do not appear together in the join result. We characterize
these combinations as ranges of attributes that do not appear together. We sketch an algorithm for finding such combinations
and present experimental results from real data sets. We then explore two potential applications of this knowledge in query
processing. In the first application, we model empty joins as materialized views, we show how they can be used for query optimization.
In the second application, we propose a strategy that uses information about empty joins for an improved join selectivity
estimation. 相似文献
5.
Karl Schnaitter Joshua Spiegel Neoklis Polyzotis 《The VLDB Journal The International Journal on Very Large Data Bases》2009,18(2):521-542
A relational ranking query uses a scoring function to limit the results of a conventional query to a small number of the most
relevant answers. The increasing popularity of this query paradigm has led to the introduction of specialized rank join operators
that integrate the selection of top tuples with join processing. These operators access just “enough” of the input in order
to generate just “enough” output and can offer significant speed-ups for query evaluation. The number of input tuples that
an operator accesses is called the input depth of the operator, and this is the driving cost factor in rank join processing. This introduces the important problem of depth estimation, which is crucial for the costing of rank join operators during query compilation and thus for their integration in optimized
physical plans. We introduce an estimation methodology, termed deep, for approximating the input depths of rank join operators in a physical execution plan. At the core of deep lies a general, principled framework that formalizes depth computation in terms of the joint distribution of scores in the
base tables. This framework results in a systematic estimation methodology that takes the characteristics of the data directly
into account and thus enables more accurate estimates. We develop novel estimation algorithms that provide an efficient realization
of the formal deep framework, and describe their integration on top of the statistics module of an existing query optimizer. We validate the
performance of deep with an extensive experimental study on data sets of varying characteristics. The results verify the effectiveness of deep as an estimation method and demonstrate its advantages over previously proposed techniques. 相似文献
6.
7.
Tae-Hyung Kwon Ki Yong Lee Myoung Ho Kim 《Journal of Intelligent Information Systems》2011,37(2):245-265
We address the problem of load shedding for continuous multi-way join queries over multiple data streams. When the arrival rates of tuples from data streams exceed the system capacity, a load shedding algorithm drops some subset of input tuples to avoid system overloads. To decide which tuples to drop among the input tuples, most existing load shedding algorithms determine the priority of each input tuple based on the frequency or some historical statistics of its join attribute value, and then drop tuples with the lowest priority. However, those value-based algorithms cannot determine the priorities of tuples properly in environments where join attribute values are unique and each join attribute value occurs at most once in each data stream. In this paper, we propose a load shedding algorithm specifically designed for such environments. The proposed load shedding algorithm determines the priority of each tuple based on the order of streams in which its join attribute value appears, rather than its join attribute value itself. Consequently, the priorities of tuples can be determined effectively in environments where join attribute values are unique and do not repeat. The experimental results show that the proposed algorithm outperforms the existing algorithms in such environments in terms of effectiveness and efficiency. 相似文献
8.
Sensor networks are widely used in many applications to collaboratively collect information from the physical environment. In these applications,the exploration of the relationship and linkage of sensing data within multiple regions can be naturally expressed by joining tuples in these regions. However,the highly distributed and resource-constraint nature of the network makes join a challenging query. In this paper,we address the problem of processing join query among different regions progressively and energy-efficiently in sensor networks. The proposed algorithm PEJA(Progressive Energy-efficient Join Algorithm) adopts an event-driven strategy to output the joining results as soon as possible,and alleviates the storage shortage problem in the in-network nodes. It also installs filters in the joining regions to prune unmatchable tuples in the early processing phase,saving lots of unnecessary transmissions. Extensive experiments on both synthetic and real world data sets indicate that the PEJA scheme outperforms other join algorithms,and it is effective in reducing the number of transmissions and the delay of query results during the join processing. 相似文献
9.
Semantic approximation of data stream joins 总被引:1,自引:0,他引:1
Das A. Gehrke J. Riedewald M. 《Knowledge and Data Engineering, IEEE Transactions on》2005,17(1):44-59
We consider the problem of approximating sliding window joins over data streams in a data stream processing system with limited resources. In our model, we deal with resource constraints by shedding load in the form of dropping tuples from the data streams. We make two main contributions. First, we define the problem space by discussing architectural models for data stream join processing and surveying suitable measures for the quality of an approximation of a set-valued query result. Second, we examine in detail a large part of this problem space. More precisely, we consider the number of generated result tuples as the quality measure and we propose optimal offline and fast online algorithms for it. In a thorough experimental study with synthetic and real data, we show the efficacy of our solutions. 相似文献
10.
如何快速发现数据集中重复或相似的记录是大数据处理技术中的一个基本问题。相似连接是一种有效的相似数据查找方法,且基于MapReduce的相似连接算法因对大数据集的处理能力强而得到广泛关注。通过分析当前相似连接算法进行自连接时存在的自连接冗余、读取原字符串复杂等问题,在Massjoin算法的基础上提出了一种改进的基于MapReduce的自连接算法。改进算法在过滤阶段增加了消除自身冗余的过滤条件,在验证阶段又采用了生成正反候选对和组合id等去冗余技术,并且读取原始字符串内容时只需读取数据集一次。实验数据显示,改进算法无论在过滤阶段还是在验证阶段都减少了算法的CPU时耗,结果表明所提改进策略是有效的。 相似文献
11.
按照元组描述的实体对其进行组织和查询处理是一种管理劣质数据的有效方法。考虑到同一个实体的同一属性存在多个描述值,因此基于实体的数据库上的连接是支持多个值的相似性连接。由于多表连接操作的连接顺序对连接性能有着重要的影响,研究了实体数据库上多表连接顺序选择方法,采用基于实体的马尔可夫链蒙特卡洛(Markov chain Monte Carol,MCMC)方法估计出实体数据库的相似性连接操作的结果大小,并以连接结果大小和有无索引作为主要代价,提出了基于实体的多连接顺序优化策略。进一步,通过实验证明了估计连接结果大小的算法在大规模数据上有着显著的优势。 相似文献
12.
13.
We propose an incremental technique for discovering duplicates in large databases of textual sequences, i.e., syntactically
different tuples, that refer to the same real-world entity. The problem is approached from a clustering perspective: given a set of tuples, the objective is to partition them into groups of duplicate tuples. Each newly arrived
tuple is assigned to an appropriate cluster via nearest-neighbor classification. This is achieved by means of a suitable hash-based index, that maps any tuple to a set of indexing keys and
assigns tuples with high syntactic similarity to the same buckets. Hence, the neighbors of a query tuple can be efficiently
identified by simply retrieving those tuples that appear in the same buckets associated to the query tuple itself, without
completely scanning the original database. Two alternative schemes for computing indexing keys are discussed and compared.
An extensive experimental evaluation on both synthetic and real data shows the effectiveness of our approach. 相似文献
14.
《Journal of Computer and System Sciences》2007,73(4):648-668
The full disjunction is a variation of the join operator that maximally combines tuples from connected relations, while preserving all information in the relations. The full disjunction can be seen as a natural extension of the binary outerjoin operator to an arbitrary number of relations and is a useful operator for information integration. This paper presents the algorithm IncrementalFD for computing the full disjunction of a set of relations. IncrementalFD improves upon previous algorithms for computing the full disjunction in four ways. First, it has a lower total runtime when computing the full result and a lower runtime when computing only k tuples of the result, for any constant k. Second, for a natural class of ranking functions, IncrementalFD can be adapted to return tuples in ranking order. Third, a variation of IncrementalFD can be used to return approximate full disjunctions (which contain maximal approximately join consistent tuples). Fourth, IncrementalFD can be adapted to have a block-based execution, instead of a tuple-based execution. 相似文献
15.
The similarity join has become an important database primitive for supporting similarity searches and data mining. A similarity join combines two sets of complex objects such that the result contains all pairs of similar objects. Two types of the similarity join are well-known, the distance range join, in which the user defines a distance threshold for the join, and the closest pair query or k-distance join, which retrieves the k most similar pairs. In this paper, we propose an important, third similarity join operation called the k-nearest neighbour join, which combines each point of one point set with its k nearest neighbours in the other set. We discover that many standard algorithms of Knowledge Discovery in Databases (KDD) such as k-means and k-medoid clustering, nearest neighbour classification, data cleansing, postprocessing of sampling-based data mining, etc. can be implemented on top of the k-nn join operation to achieve performance improvements without affecting the quality of the result of these algorithms. We propose a new algorithm to compute the k-nearest neighbour join using the multipage index (MuX), a specialised index structure for the similarity join. To reduce both CPU and I/O costs, we develop optimal loading and processing strategies. 相似文献
16.
The optimization capabilities of RDBMSs make them attractive for executing data transformations. However, despite the fact that many useful data transformations can be expressed as relational queries, an important class of data transformations that produce several output tuples for a single input tuple cannot be expressed in that way.
To overcome this limitation, we propose to extend Relational Algebra with a new operator named data mapper. In this paper, we formalize the data mapper operator and investigate some of its properties. We then propose a set of algebraic rewriting rules that enable the logical optimization of expressions with mappers and prove their correctness. Finally, we experimentally study the proposed optimizations and identify the key factors that influence the optimization gains. 相似文献
17.
Aggregate keyword search on large relational databases 总被引:2,自引:1,他引:1
Keyword search has been recently extended to relational databases to retrieve information from text-rich attributes. However,
all the existing methods focus on finding individual tuples matching a set of query keywords from one table or the join of
multiple tables. In this paper, we motivate a novel problem of aggregate keyword search: finding minimal group-bys covering
a set of query keywords well, which is useful in many applications. We develop two interesting approaches to tackle the problem.
We further extend our methods to allow partial matches and matches using a keyword ontology. An extensive empirical evaluation
using both real data sets and synthetic data sets is reported to verify the effectiveness of aggregate keyword search and
the efficiency of our methods. 相似文献
18.
Bhowmick S.S. Madria S.K. Wee Keong Ng 《Knowledge and Data Engineering, IEEE Transactions on》2003,15(2):423-441
In this paper, we present a mechanism for detecting and representing changes, given the old and new versions of a set of interlinked Web documents, retrieved in response to a user's query. In particular, we show how to detect and represent Web deltas, i.e., changes in the Web documents that are relevant to a user's query in the context of our Web warehousing system called WHOWEDA (Warehouse of Web Data). In WHOWEDA, Web information is materialized views stored in Web tables in the form of Web tuples. These Web tuples, represented as directed graphs, can be manipulated using a set of Web algebraic operators. In this paper, we present a mechanism to detect relevant Web deltas using Web algebraic operators such as the Web join and the outer Web join. Web join is used to detect identical documents residing in two Web tables, whereas, outer Web join, a derivative of Web join, is used to identify dangling Web tuples. We show how to represent these changes using delta Web tables. We develop formal algorithms for the generation of delta Web tables identifying Web documents which have been added, deleted, or modified since the last query. 相似文献
19.
H.H. Shahri S.H. Shahri 《Intelligent Systems, IEEE》2006,21(5):63-71
Data cleaning is an inevitable problem when integrating data from distributed operational databases, because no unified set of standards spans all the distributed sources. One of the most challenging phases of data cleaning is removing fuzzy duplicate records. Approximate or fuzzy duplicates pertain to two or more tuples that describe the same real-world entity using different syntaxes. In other words, they have the same semantics but different syntaxes. Eliminating fuzzy duplicates is applicable in any database but is critical in data-integration and analytical-processing domains, which involve data warehouses, data mining applications, and decision support systems. Earlier approaches, which required hard coding rules based on a schema, were time consuming and tedious, and you couldn't later adapt the rules. We propose a novel duplicate-elimination framework which exploits fuzzy inference and includes unique machine learning capabilities to let users clean their data flexibly and effortlessly without requiring any coding 相似文献
20.
String similarity search and join are two important operations in data cleaning and integration, which extend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-world applications, such as spell checking, duplicate detection, entity resolution, and webpage clustering. Although these two problems have been extensively studied in the recent decade, there is no thorough survey. In this paper, we present a comprehensive survey on string similarity search and join. We first give the problem definitions and introduce widely-used similarity functions to quantify the similarity. We then present an extensive set of algorithms for string similarity search and join. We also discuss their variants, including approximate entity extraction, type-ahead search, and approximate substring matching. Finally, we provide some open datasets and summarize some research challenges and open problems. 相似文献