共查询到20条相似文献,搜索用时 93 毫秒
1.
Distributed Streams Algorithms for Sliding Windows 总被引:1,自引:0,他引:1
Massive data sets often arise as physically distributed, parallel data
streams, and it is important to estimate various aggregates and statistics on the
union of these streams. This paper presents algorithms for estimating aggregate
functions over a “sliding window” of the N most recent data items in one or more
streams. Our results include:
1. For a single stream,we present the first ε-approximation scheme for the number
of 1’s in a sliding window that is optimal in both worst case time and space.
We also present the first ε-approximation scheme for the sum of integers in
[0..R] in a sliding window that is optimal in both worst case time and space
(assuming R is at most polynomial in N). Both algorithms are deterministic
and use only logarithmic memory words.
2. In contrast, we show that any deterministic algorithm that estimates, to within
a small constant relative error, the number of 1’s (or the sum of integers) in a
sliding window on the union of distributed streams requires Ω(N) space.
3. We present the first (randomized) (ε, δ)-approximation scheme for the number
of 1’s in a sliding window on the union of distributed streams that uses only
logarithmic memory words. We also present the first (ε, δ)-approximation
scheme for the number of distinct values in a sliding window on distributed
streams that uses only logarithmic memory words.
Our results are obtained using a novel family of synopsis data structures called
waves. 相似文献
2.
We investigate the diameter
problem in the streaming and sliding-window
models. We show that, for
a stream of $n$ points or a sliding window of size $n$, any exact
algorithm for diameter requires $\Omega(n)$ bits of
space. We present a simple $\epsilon$-approximation algorithm
for computing the diameter in the streaming model. Our main result
is an $\epsilon$-approximation algorithm
that maintains the diameter in two dimensions in the sliding-window
model using $O(({1}/{\epsilon^{3/2}}) \log^{3}n(\log R+\log\log n +
\log ({1}/{\epsilon})))$ bits of space, where $R$ is the maximum, over all
windows, of the ratio of the diameter to the minimum non-zero distance
between any two points in the window. 相似文献
3.
Mehdi Sharifzadeh Cyrus Shahabi 《The VLDB Journal The International Journal on Very Large Data Bases》2009,18(1):57-75
Several studies have exploited the properties of Voronoi diagrams to improve the efficiency of variations of the nearest neighbor
search on stored datasets. However, the significance of Voronoi diagrams and their basic building blocks, Voronoi cells, has
been neglected when the geometry data is incrementally becoming available as a data stream. In this paper, we study the problem
of Voronoi cell computation for fixed 2-d site points when the locations of the neighboring sites arrive as a spatial data stream. We show that the non-streaming solution to the problem does not meet the memory requirements of many realistic
scenarios over a sliding window. Hence, we propose AVC-SW, an approximate streaming algorithm that computes (1 + ε)-approximations to the actual exact Voronoi cell in O(κ) where κ is its sample size. With the sliding window model and random arrival of points, we show both analytically and
experimentally that for given window size w and parameter k, AVC-SW reduces the expected memory requirements of the classic algorithm from O(w) to regardless of the distribution of the points in the 2-d space. This is a significant improvement for most of the real-world
scenarios where w ≫ k. 相似文献
4.
We study the problem of maintaining a sketch of recent elements of a data stream. Motivated by applications involving network
data, we consider streams that are asynchronous, in which the observed order of data is not the same as the time order in which the data was generated. The notion of recent
elements of a stream is modeled by the sliding timestamp window, which is the set of elements with timestamps that are close to the current time. We design algorithms for maintaining sketches
of all elements within the sliding timestamp window that can give provably accurate estimates of two basic aggregates, the
sum and the median, of a stream of numbers. The space taken by the sketches, the time needed for querying the sketch, and
the time for inserting new elements into the sketch are all polylogarithmic with respect to the maximum window size. Our sketches
can be easily combined in a lossless and compact way, making them useful for distributed computations over data streams. Previous
works on sketching recent elements of a data stream have all considered the more restrictive scenario of synchronous streams,
where the observed order of data is the same as the time order in which the data was generated. Our notion of recency of elements
is more general than that studied in previous work, and thus our sketches are more robust to network delays and asynchrony.
The work of the authors was supported in part through NSF grants CNS 0520102 and CNS 0520009.
A preliminary version of this paper appeared in Proceedings of the ACM Symposium on Principles of Distributed Computing (PODC)
2006, pages 82–91.
Work done while the third author was at Rensselaer Polytechnic Institute.
Authors are listed in reverse alphabetical order. 相似文献
5.
We investigate the diameter
problem in the streaming and sliding-window
models. We show that, for
a stream of nn points or a sliding window of size nn, any exact
algorithm for diameter requires W(n)\Omega(n) bits of
space. We present a simple e\epsilon-approximation algorithm
for computing the diameter in the streaming model. Our main result
is an e\epsilon-approximation algorithm
that maintains the diameter in two dimensions in the sliding-window
model using O((1/e3/2) log3n(logR+loglogn + log(1/e)))O(({1}/{\epsilon^{3/2}}) \log^{3}n(\log R+\log\log n +
\log ({1}/{\epsilon}))) bits of space, where RR is the maximum, over all
windows, of the ratio of the diameter to the minimum non-zero distance
between any two points in the window. 相似文献
6.
基于滑动窗口的数据流压缩技术及连续查询处理方法 总被引:8,自引:0,他引:8
基于滑动窗口的连续查询处理是数据流研究领域的一个热点问题.已有的研究工作均假设滑动窗口内的数据能够全部保存在主存中,若滑动窗口内的数据量超过了可用主存空间,已有的查询处理方法则无法正常工作.提出两种数据流上的滑动窗口压缩技术,有效地降低了滑动窗口的存储空间需求.同时,给出了基于压缩滑动窗口的连续查询处理算法,理论分析和实验结果表明,这些算法具有很好的性能,能够满足数据流连续查询处理的实时性要求. 相似文献
7.
Lin X. Xu J. Zhang Q. Hongjun Lu Jeffrey Xu Yu Zhou X. Yuan Y. 《Knowledge and Data Engineering, IEEE Transactions on》2006,18(5):683-698
Quantile computation has many applications including data mining and financial data analysis. It has been shown that an /spl epsi/-approximate summary can be maintained so that, given a quantile query (/spl phi/,/spl epsi/), the data item at rank /spl lceil//spl phi/N/spl rceil/ may be approximately obtained within the rank error precision /spl epsi/N over all N data items in a data stream or in a sliding window. However, scalable online processing of massive continuous quantile queries with different /spl phi/ and /spl epsi/ poses a new challenge because the summary is continuously updated with new arrivals of data items. In this paper, first we aim to dramatically reduce the number of distinct query results by grouping a set of different queries into a cluster so that they can be processed virtually as a single query while the precision requirements from users can be retained. Second, we aim to minimize the total query processing costs. Efficient algorithms are developed to minimize the total number of times for reprocessing clusters and to produce the minimum number of clusters, respectively. The techniques are extended to maintain near-optimal clustering when queries are registered and removed in an arbitrary fashion against whole data streams or sliding windows. In addition to theoretical analysis, our performance study indicates that the proposed techniques are indeed scalable with respect to the number of input queries as well as the number of items and the item arrival rate in a data stream. 相似文献
8.
Continuous Nearest Neighbor Queries over Sliding Windows 总被引:1,自引:0,他引:1
《Knowledge and Data Engineering, IEEE Transactions on》2007,19(6):789-803
This paper studies continuous monitoring of nearest neighbor (NN) queries over sliding window streams. According to this model, data points continuously stream in the system, and they are considered valid only while they belong to a sliding window that contains 1) the W most recent arrivals (count-based) or 2) the arrivals within a fixed interval W covering the most recent time stamps (time-based). The task of the query processor is to constantly maintain the result of long-running NN queries among the valid data. We present two processing techniques that apply to both count-based and time-based windows. The first one adapts conceptual partitioning, the best existing method for continuous NN monitoring over update streams, to the sliding window model. The second technique reduces the problem to skyline maintenance in the distance-time space and precomputes the future changes in the NN set. We analyze the performance of both algorithms and extend them to variations of NN search. Finally, we compare their efficiency through a comprehensive experimental evaluation. The skyline-based algorithm achieves lower CPU cost, at the expense of slightly larger space overhead. 相似文献
9.
《Knowledge and Data Engineering, IEEE Transactions on》2007,19(4):c4-c4
This paper studies continuous monitoring of nearest neighbor (NN) queries over sliding window streams. According to this model, data points continuously stream in the system, and they are considered valid only while they belong to a sliding window that contains 1) the W most recent arrivals (count-based) or 2) the arrivals within a fixed interval W covering the most recent time stamps (time-based). The task of the query processor is to constantly maintain the result of long-running NN queries among the valid data. We present two processing techniques that apply to both count-based and time-based windows. The first one adapts conceptual partitioning, the best existing method for continuous NN monitoring over update streams, to the sliding window model. The second technique reduces the problem to skyline maintenance in the distance-time space and precomputes the future changes in the NN set. We analyze the performance of both algorithms and extend them to variations of NN search. Finally, we compare their efficiency through a comprehensive experimental evaluation. The skyline-based algorithm achieves lower CPU cost, at the expense of slightly larger space overhead 相似文献
10.
We consider the problem of indexing a set of objects moving in d-dimensional spaces along linear trajectories. A simple external-memory indexing scheme is proposed to efficiently answer
general range queries. The following are examples of the queries that can be answered by the proposed method: report all moving
objects that will (i) pass between two given points within a specified time interval; (ii) become within a given distance
from some or all of a given set of other moving objects. Our scheme is based on mapping the objects to a dual space, where
queries about moving objects are transformed into polyhedral queries concerning their speeds and initial locations. We then
present a simple method for answering such polyhedral queries, based on partitioning the space into disjoint regions and using
a B+-tree to index the points in each region. By appropriately selecting the boundaries of each region, we guarantee an average
search time that matches a known lower bound for the problem. Specifically, for a fixed d, if the coordinates of a given set of N points are statistically independent, the proposed technique answers polyhedral queries, on the average, in O((N/B)1−1/d⋅(log B N)1/d+K/B) I/O's using O(N/B) space, where B is the block size, and K is the number of reported points. Our approach is novel in that, while it provides a theoretical upper bound on the average
query time, it avoids the use of complicated data structures, making it an effective candidate for practical applications.
The proposed index is also dynamic in the sense that it allows object insertion and deletion in an amortized update cost of
log B(N) I/O's. Experimental results are presented to show the superiority of the proposed index over other methods based on R-trees.
recommend Ahmed Elmagarmid 相似文献
11.
In this paper, we present two linear-size external memory data structures for approximate range searching. Our first structure,
the BAR-B-tree, stores a set of N points in ℝ
d
and can report all points inside a query range Q by accessing O(log
B
N+ε
γ
+k
ε
/B) disk blocks, where B is the disk block size, γ=1−d for convex queries and γ=−d otherwise, and k
ε
is the number of points lying within a distance of ε⋅diam (Q) to the query range Q. Our second structure, the object-BAR-B-tree, is able to store objects of arbitrary shapes of constant complexity and provides similar query guarantees. In addition,
both structures can support other types of range searching queries such as range aggregation and nearest-neighbor. Finally,
we present I/O-efficient algorithms to build these structures. 相似文献
12.
13.
Ricardo H. Nochetto 《Calcolo》1985,22(4):501-534
In this paper we derive L2-error estimates for multidimensional two-phase Stefan problems involving further non-linearities such as non-linear flux
conditions. We approximate the enthalpy formulation by a regularization procedure combined with a C0-piecewise linear finite element scheme in space, and the implicit Euler scheme in time.
Under some restrictions on the initial datum and on the finite element mesh, we obtain an L2-rate of convergence of order ε1/2 for regularized problems, and an L2-rate of convergence of order (h|log h|k+τ+ε)1/2 for the discrete problems (k=0,1). These estimates lead to the choice h∼ε∼τ and yeild a global L2-rate essentially of order h1/2.
The a priori relationship between the approximation parameters allows the discrete problem to satisfy a maximum principle
(without a lumping procedure); from which a priori bounds and monotonicity properties of the discrete solutions are shown.
This work was supported by the “Consejo Nacional de Investigaciones Científicas y Técnicas de la República Argentina”. 相似文献
14.
A randomized learning algorithm {POLLY} is presented that efficiently learns intersections of s halfspaces in n dimensions, in time polynomial in both s and n . The learning protocol is the PAC (probably approximately correct) model of Valiant, augmented with membership queries.
In particular, {POLLY} receives a set S of m = poly(n,s,1/ε,1/δ) randomly generated points from an arbitrary distribution over the unit hypercube, and is told exactly which points are contained
in, and which points are not contained in, the convex polyhedron P defined by the halfspaces. {POLLY} may also obtain the same information about points of its own choosing. It is shown that
after poly(n , s , 1/ε , 1/δ , log(1/d) ) time, the probability that {POLLY} fails to output a collection of s halfspaces with classification error at most ε , is at most δ . Here, d is the minimum distance between the boundary of the target and those examples in S that are not lying on the boundary. The parameter log(1/d) can be bounded by the number of bits needed to encode the coefficients of the bounding hyperplanes and the coordinates of
the sampled examples S . Moreover, {POLLY} can be extended to learn unions of k disjoint polyhedra with each polyhedron having at most s facets, in time poly(n , k , s , 1/ε , 1/δ , log(1/d) , 1/γ ) where γ is the minimum distance between any two distinct polyhedra.
Received February 5, 1997; revised July 1, 1997. 相似文献
15.
We present approximation algorithms for two closely related bicriteria problems. Given a graph with two weight functions
on the edges, length and cost, we consider the Bounded-Diameter Minimum-Cost Steiner Tree
(BDMST) problem and the Bounded-Diameter Minimum-Cardinality Edge Addition (BDMC) problem. We present a parameterized approximation
algorithm for the BDMST problem with a bicriteria approximation ratio of
(O(p log s/log p),O(log s/log p)) where the first factor gives the relaxation on the diameter bound, the second factor is
the cost-approximation factor, s is the number of required nodes and p, 1 ≤ p < s, is an input parameter. The parameter p
allows a trade-off between the two approximation factors. This is the first improvement of the cost-approximation factor since
the scheme proposed by Marathe et al. [9]. For example, p can be set to sα to obtain an (O(sα),O(1)) approximation for a constant α. The algorithm is very simple and is suitable for distributed implementations. We
also present the first algorithm for Bounded-Hops Minimum-Cost Steiner Tree for complete graphs with triangle inequality.
This model includes graphs defined by points in a Euclidean space of any dimension. The algorithm guarantees an approximation
ratio of (O(logds),O(logds)) where d is the bound on the diameter. This is an improvement over the general-case approximation when d is comparable
with s. For example, the ratio is (O(1),O(1)) for any d = sα where α is a constant between 0 and 1. For the case where the number of terminals is a constant and all edge lengths are
unit, we have a polynomial-time algorithm. This can be extended to any length function providing a (1 + ε) in the approximation
with ε showing up in the time complexity of the algorithm. For another special case, where the cost of any edge is either
1 or 0 and the length of each edge is positive, an algorithm with approximation ratio of (O(log(c log s)), O(log(c log s)))
is presented, where
c is the cost of the optimal solution. This approximation is a significant improvement over (O(log s),O(log s)) when the cost
of the solution c is much smaller than the number of terminals s. This is useful when an existing multicast network is to
be modified to accommodate new terminals with the addition of as few new edges as possible. We also propose an approximation
algorithm for the Bounded-Diameter Minimum-Cardinality Edge Addition problem, which achieves an O(log n) approximation while
relaxing the diameter bound by 2. While this ratio is the same as the one claimed in [3], this algorithm is simple and combinatorial
rather than based on the Linear Program solution and can be readily modified for a distributed implementation. 相似文献
16.
Combinatorial property testing deals with the following relaxation of decision problems: Given a fixed property and an input
x, one wants to decide whether x satisfies the property or is “far” from satisfying it. The main focus of property testing is in identifying large families
of properties that can be tested with a certain number of queries to the input. In this paper we study the relation between
the space complexity of a language and its query complexity. Our main result is that for any space complexity s(n) ≤ log n there is a language with space complexity O(s(n)) and query complexity 2Ω(s(n)).
Our result has implications with respect to testing languages accepted by certain restricted machines. Alon et al. [FOCS 1999]
have shown that any regular language is testable with a constant number of queries. It is well known that any language in
space o(log log n) is regular, thus implying that such languages can be so tested. It was previously known that there are languages in space
O(log n) that are not testable with a constant number of queries and Newman [FOCS 2000] raised the question of closing the exponential
gap between these two results. A special case of our main result resolves this problem as it implies that there is a language
in space O(log log n) that is not testable with a constant number of queries. It was also previously known that the class of testable properties
cannot be extended to all context-free languages. We further show that one cannot even extend the family of testable languages
to the class of languages accepted by single counter machines.
相似文献
17.
在数据流应用中,系统经常需要处理大量的滑动窗口连续查询,采用共享滑动窗口技术可以有效节省存储空间,提高系统整体的查询处理能力。但是共享滑动窗口技术会增大单个查询的响应延迟,降低单个查询的服务质量。针对这个问题,论文提出了加权共享滑动窗口的概念,并提出了三种优化的连接执行算法,优先响应重要的滑动窗口查询,从而提高了系统整体的服务质量。理论分析和实验结果表明论文提出的方法是行之有效的。 相似文献
18.
滑动窗口是一种对最近一段时间内的数据进行挖掘的有效的技术,本文提出一种基于滑动窗口的流数据频繁项挖掘算法.算法采用了链表队列策略大大简化了算法,提高了挖掘的效率.对于给定的阈值S、误差ε和窗口长度n,算法可以检测在窗口内频度超过Sn的数据流频繁项,且使误差在εn以内.算法的空间复杂度为O(ε-1),对每个数据项的处理和查询时间均为O(1).在此基础上,我们还将该算法进行了扩展,可以通过参数的变化得到不同的流数据频繁项挖掘算法,使得算法的时间和空间复杂度之间得到调节.通过大量的实验证明,本文算法比其它类似算法具有更好的精度以及时间和空间效率. 相似文献
19.
为了提高在同一数据流上同时计算多个连续极值查询(MAX或MIN)时的处理能力,对查询间资源共享技术进行了研究.提出了一种称为"关键点集"的裁剪策略,系统仅需保存少量数据即可满足所有查询的需要.发掘多个查询间的相似性和可共享的计算存储资源,提出了一个多极值查询处理算法MCEQP.采用链表结构实现的该算法,当一个新数据到达时最多需要O(M K)时间即可更新全部K个查询的结果,其中M为关键点集包含数据的个数.MCEQP采用触发器驱动的方式,只在某些特定时刻才需要计算因数据失效引起的查询结果变化,更新K个查询结果所需时间为O(K).理论分析和实验证明,对于滑动窗口数据流上的多个极值查询,MCEQP算法在降低存储开销和提高性能方面均优于现有的通用方法. 相似文献