首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 93 毫秒
1.
Distributed Streams Algorithms for Sliding Windows   总被引:1,自引:0,他引:1  
Massive data sets often arise as physically distributed, parallel data streams, and it is important to estimate various aggregates and statistics on the union of these streams. This paper presents algorithms for estimating aggregate functions over a “sliding window” of the N most recent data items in one or more streams. Our results include: 1. For a single stream,we present the first ε-approximation scheme for the number of 1’s in a sliding window that is optimal in both worst case time and space. We also present the first ε-approximation scheme for the sum of integers in [0..R] in a sliding window that is optimal in both worst case time and space (assuming R is at most polynomial in N). Both algorithms are deterministic and use only logarithmic memory words. 2. In contrast, we show that any deterministic algorithm that estimates, to within a small constant relative error, the number of 1’s (or the sum of integers) in a sliding window on the union of distributed streams requires Ω(N) space. 3. We present the first (randomized) (ε, δ)-approximation scheme for the number of 1’s in a sliding window on the union of distributed streams that uses only logarithmic memory words. We also present the first (ε, δ)-approximation scheme for the number of distinct values in a sliding window on distributed streams that uses only logarithmic memory words. Our results are obtained using a novel family of synopsis data structures called waves.  相似文献   

2.
We investigate the diameter problem in the streaming and sliding-window models. We show that, for a stream of $n$ points or a sliding window of size $n$, any exact algorithm for diameter requires $\Omega(n)$ bits of space. We present a simple $\epsilon$-approximation algorithm for computing the diameter in the streaming model. Our main result is an $\epsilon$-approximation algorithm that maintains the diameter in two dimensions in the sliding-window model using $O(({1}/{\epsilon^{3/2}}) \log^{3}n(\log R+\log\log n + \log ({1}/{\epsilon})))$ bits of space, where $R$ is the maximum, over all windows, of the ratio of the diameter to the minimum non-zero distance between any two points in the window.  相似文献   

3.
Several studies have exploited the properties of Voronoi diagrams to improve the efficiency of variations of the nearest neighbor search on stored datasets. However, the significance of Voronoi diagrams and their basic building blocks, Voronoi cells, has been neglected when the geometry data is incrementally becoming available as a data stream. In this paper, we study the problem of Voronoi cell computation for fixed 2-d site points when the locations of the neighboring sites arrive as a spatial data stream. We show that the non-streaming solution to the problem does not meet the memory requirements of many realistic scenarios over a sliding window. Hence, we propose AVC-SW, an approximate streaming algorithm that computes (1 + ε)-approximations to the actual exact Voronoi cell in O(κ) where κ is its sample size. With the sliding window model and random arrival of points, we show both analytically and experimentally that for given window size w and parameter k, AVC-SW reduces the expected memory requirements of the classic algorithm from O(w) to regardless of the distribution of the points in the 2-d space. This is a significant improvement for most of the real-world scenarios where wk.  相似文献   

4.
We study the problem of maintaining a sketch of recent elements of a data stream. Motivated by applications involving network data, we consider streams that are asynchronous, in which the observed order of data is not the same as the time order in which the data was generated. The notion of recent elements of a stream is modeled by the sliding timestamp window, which is the set of elements with timestamps that are close to the current time. We design algorithms for maintaining sketches of all elements within the sliding timestamp window that can give provably accurate estimates of two basic aggregates, the sum and the median, of a stream of numbers. The space taken by the sketches, the time needed for querying the sketch, and the time for inserting new elements into the sketch are all polylogarithmic with respect to the maximum window size. Our sketches can be easily combined in a lossless and compact way, making them useful for distributed computations over data streams. Previous works on sketching recent elements of a data stream have all considered the more restrictive scenario of synchronous streams, where the observed order of data is the same as the time order in which the data was generated. Our notion of recency of elements is more general than that studied in previous work, and thus our sketches are more robust to network delays and asynchrony. The work of the authors was supported in part through NSF grants CNS 0520102 and CNS 0520009. A preliminary version of this paper appeared in Proceedings of the ACM Symposium on Principles of Distributed Computing (PODC) 2006, pages 82–91. Work done while the third author was at Rensselaer Polytechnic Institute. Authors are listed in reverse alphabetical order.  相似文献   

5.
We investigate the diameter problem in the streaming and sliding-window models. We show that, for a stream of nn points or a sliding window of size nn, any exact algorithm for diameter requires W(n)\Omega(n) bits of space. We present a simple e\epsilon-approximation algorithm for computing the diameter in the streaming model. Our main result is an e\epsilon-approximation algorithm that maintains the diameter in two dimensions in the sliding-window model using O((1/e3/2) log3n(logR+loglogn + log(1/e)))O(({1}/{\epsilon^{3/2}}) \log^{3}n(\log R+\log\log n + \log ({1}/{\epsilon}))) bits of space, where RR is the maximum, over all windows, of the ratio of the diameter to the minimum non-zero distance between any two points in the window.  相似文献   

6.
基于滑动窗口的数据流压缩技术及连续查询处理方法   总被引:8,自引:0,他引:8  
基于滑动窗口的连续查询处理是数据流研究领域的一个热点问题.已有的研究工作均假设滑动窗口内的数据能够全部保存在主存中,若滑动窗口内的数据量超过了可用主存空间,已有的查询处理方法则无法正常工作.提出两种数据流上的滑动窗口压缩技术,有效地降低了滑动窗口的存储空间需求.同时,给出了基于压缩滑动窗口的连续查询处理算法,理论分析和实验结果表明,这些算法具有很好的性能,能够满足数据流连续查询处理的实时性要求.  相似文献   

7.
Quantile computation has many applications including data mining and financial data analysis. It has been shown that an /spl epsi/-approximate summary can be maintained so that, given a quantile query (/spl phi/,/spl epsi/), the data item at rank /spl lceil//spl phi/N/spl rceil/ may be approximately obtained within the rank error precision /spl epsi/N over all N data items in a data stream or in a sliding window. However, scalable online processing of massive continuous quantile queries with different /spl phi/ and /spl epsi/ poses a new challenge because the summary is continuously updated with new arrivals of data items. In this paper, first we aim to dramatically reduce the number of distinct query results by grouping a set of different queries into a cluster so that they can be processed virtually as a single query while the precision requirements from users can be retained. Second, we aim to minimize the total query processing costs. Efficient algorithms are developed to minimize the total number of times for reprocessing clusters and to produce the minimum number of clusters, respectively. The techniques are extended to maintain near-optimal clustering when queries are registered and removed in an arbitrary fashion against whole data streams or sliding windows. In addition to theoretical analysis, our performance study indicates that the proposed techniques are indeed scalable with respect to the number of input queries as well as the number of items and the item arrival rate in a data stream.  相似文献   

8.
Continuous Nearest Neighbor Queries over Sliding Windows   总被引:1,自引:0,他引:1  
This paper studies continuous monitoring of nearest neighbor (NN) queries over sliding window streams. According to this model, data points continuously stream in the system, and they are considered valid only while they belong to a sliding window that contains 1) the W most recent arrivals (count-based) or 2) the arrivals within a fixed interval W covering the most recent time stamps (time-based). The task of the query processor is to constantly maintain the result of long-running NN queries among the valid data. We present two processing techniques that apply to both count-based and time-based windows. The first one adapts conceptual partitioning, the best existing method for continuous NN monitoring over update streams, to the sliding window model. The second technique reduces the problem to skyline maintenance in the distance-time space and precomputes the future changes in the NN set. We analyze the performance of both algorithms and extend them to variations of NN search. Finally, we compare their efficiency through a comprehensive experimental evaluation. The skyline-based algorithm achieves lower CPU cost, at the expense of slightly larger space overhead.  相似文献   

9.
Cover4     
This paper studies continuous monitoring of nearest neighbor (NN) queries over sliding window streams. According to this model, data points continuously stream in the system, and they are considered valid only while they belong to a sliding window that contains 1) the W most recent arrivals (count-based) or 2) the arrivals within a fixed interval W covering the most recent time stamps (time-based). The task of the query processor is to constantly maintain the result of long-running NN queries among the valid data. We present two processing techniques that apply to both count-based and time-based windows. The first one adapts conceptual partitioning, the best existing method for continuous NN monitoring over update streams, to the sliding window model. The second technique reduces the problem to skyline maintenance in the distance-time space and precomputes the future changes in the NN set. We analyze the performance of both algorithms and extend them to variations of NN search. Finally, we compare their efficiency through a comprehensive experimental evaluation. The skyline-based algorithm achieves lower CPU cost, at the expense of slightly larger space overhead  相似文献   

10.
We consider the problem of indexing a set of objects moving in d-dimensional spaces along linear trajectories. A simple external-memory indexing scheme is proposed to efficiently answer general range queries. The following are examples of the queries that can be answered by the proposed method: report all moving objects that will (i) pass between two given points within a specified time interval; (ii) become within a given distance from some or all of a given set of other moving objects. Our scheme is based on mapping the objects to a dual space, where queries about moving objects are transformed into polyhedral queries concerning their speeds and initial locations. We then present a simple method for answering such polyhedral queries, based on partitioning the space into disjoint regions and using a B+-tree to index the points in each region. By appropriately selecting the boundaries of each region, we guarantee an average search time that matches a known lower bound for the problem. Specifically, for a fixed d, if the coordinates of a given set of N points are statistically independent, the proposed technique answers polyhedral queries, on the average, in O((N/B)1−1/d⋅(log B N)1/d+K/B) I/O's using O(N/B) space, where B is the block size, and K is the number of reported points. Our approach is novel in that, while it provides a theoretical upper bound on the average query time, it avoids the use of complicated data structures, making it an effective candidate for practical applications. The proposed index is also dynamic in the sense that it allows object insertion and deletion in an amortized update cost of log B(N) I/O's. Experimental results are presented to show the superiority of the proposed index over other methods based on R-trees. recommend Ahmed Elmagarmid  相似文献   

11.
In this paper, we present two linear-size external memory data structures for approximate range searching. Our first structure, the BAR-B-tree, stores a set of N points in ℝ d and can report all points inside a query range Q by accessing O(log  B N+ε γ +k ε /B) disk blocks, where B is the disk block size, γ=1−d for convex queries and γ=−d otherwise, and k ε is the number of points lying within a distance of ε⋅diam (Q) to the query range Q. Our second structure, the object-BAR-B-tree, is able to store objects of arbitrary shapes of constant complexity and provides similar query guarantees. In addition, both structures can support other types of range searching queries such as range aggregation and nearest-neighbor. Finally, we present I/O-efficient algorithms to build these structures.  相似文献   

12.
相似性查询是一种非常重要的数据挖掘应用。由于数据流具有无限、高速等特性,传统的查询算法不能直接应用于数据流。提出了一种基于小波滑动窗口的多数据流相似性查询算法。算法首先将滑动窗口划分成若干等宽基本窗口,然后对每个基本窗口内的数据进行小波分解与系数约简,从而形成小波摘要窗口。执行相似性查询时,直接基于小波摘要进行计算,而无需数据重构。由于利用了小波分解的线性处理优点,算法具有较低的时间复杂度。最后,基于实际数据对算法进行了实验,实验结果证明了算法的有效性。  相似文献   

13.
In this paper we derive L2-error estimates for multidimensional two-phase Stefan problems involving further non-linearities such as non-linear flux conditions. We approximate the enthalpy formulation by a regularization procedure combined with a C0-piecewise linear finite element scheme in space, and the implicit Euler scheme in time. Under some restrictions on the initial datum and on the finite element mesh, we obtain an L2-rate of convergence of order ε1/2 for regularized problems, and an L2-rate of convergence of order (h|log h|k+τ+ε)1/2 for the discrete problems (k=0,1). These estimates lead to the choice h∼ε∼τ and yeild a global L2-rate essentially of order h1/2. The a priori relationship between the approximation parameters allows the discrete problem to satisfy a maximum principle (without a lumping procedure); from which a priori bounds and monotonicity properties of the discrete solutions are shown. This work was supported by the “Consejo Nacional de Investigaciones Científicas y Técnicas de la República Argentina”.  相似文献   

14.
S. Kwek  L. Pitt 《Algorithmica》1998,22(1-2):53-75
A randomized learning algorithm {POLLY} is presented that efficiently learns intersections of s halfspaces in n dimensions, in time polynomial in both s and n . The learning protocol is the PAC (probably approximately correct) model of Valiant, augmented with membership queries. In particular, {POLLY} receives a set S of m = poly(n,s,1/ε,1/δ) randomly generated points from an arbitrary distribution over the unit hypercube, and is told exactly which points are contained in, and which points are not contained in, the convex polyhedron P defined by the halfspaces. {POLLY} may also obtain the same information about points of its own choosing. It is shown that after poly(n , s , 1/ε , 1/δ , log(1/d) ) time, the probability that {POLLY} fails to output a collection of s halfspaces with classification error at most ε , is at most δ . Here, d is the minimum distance between the boundary of the target and those examples in S that are not lying on the boundary. The parameter log(1/d) can be bounded by the number of bits needed to encode the coefficients of the bounding hyperplanes and the coordinates of the sampled examples S . Moreover, {POLLY} can be extended to learn unions of k disjoint polyhedra with each polyhedron having at most s facets, in time poly(n , k , s , 1/ε , 1/δ , log(1/d) , 1/γ ) where γ is the minimum distance between any two distinct polyhedra. Received February 5, 1997; revised July 1, 1997.  相似文献   

15.
We present approximation algorithms for two closely related bicriteria problems. Given a graph with two weight functions on the edges, length and cost, we consider the Bounded-Diameter Minimum-Cost Steiner Tree (BDMST) problem and the Bounded-Diameter Minimum-Cardinality Edge Addition (BDMC) problem. We present a parameterized approximation algorithm for the BDMST problem with a bicriteria approximation ratio of (O(p log s/log p),O(log s/log p)) where the first factor gives the relaxation on the diameter bound, the second factor is the cost-approximation factor, s is the number of required nodes and p, 1 ≤ p < s, is an input parameter. The parameter p allows a trade-off between the two approximation factors. This is the first improvement of the cost-approximation factor since the scheme proposed by Marathe et al. [9]. For example, p can be set to sα to obtain an (O(sα),O(1)) approximation for a constant α. The algorithm is very simple and is suitable for distributed implementations. We also present the first algorithm for Bounded-Hops Minimum-Cost Steiner Tree for complete graphs with triangle inequality. This model includes graphs defined by points in a Euclidean space of any dimension. The algorithm guarantees an approximation ratio of (O(logds),O(logds)) where d is the bound on the diameter. This is an improvement over the general-case approximation when d is comparable with s. For example, the ratio is (O(1),O(1)) for any d = sα where α is a constant between 0 and 1. For the case where the number of terminals is a constant and all edge lengths are unit, we have a polynomial-time algorithm. This can be extended to any length function providing a (1 + ε) in the approximation with ε showing up in the time complexity of the algorithm. For another special case, where the cost of any edge is either 1 or 0 and the length of each edge is positive, an algorithm with approximation ratio of (O(log(c log s)), O(log(c log s))) is presented, where c is the cost of the optimal solution. This approximation is a significant improvement over (O(log s),O(log s)) when the cost of the solution c is much smaller than the number of terminals s. This is useful when an existing multicast network is to be modified to accommodate new terminals with the addition of as few new edges as possible. We also propose an approximation algorithm for the Bounded-Diameter Minimum-Cardinality Edge Addition problem, which achieves an O(log n) approximation while relaxing the diameter bound by 2. While this ratio is the same as the one claimed in [3], this algorithm is simple and combinatorial rather than based on the Linear Program solution and can be readily modified for a distributed implementation.  相似文献   

16.
Combinatorial property testing deals with the following relaxation of decision problems: Given a fixed property and an input x, one wants to decide whether x satisfies the property or is “far” from satisfying it. The main focus of property testing is in identifying large families of properties that can be tested with a certain number of queries to the input. In this paper we study the relation between the space complexity of a language and its query complexity. Our main result is that for any space complexity s(n) ≤ log n there is a language with space complexity O(s(n)) and query complexity 2Ω(s(n)). Our result has implications with respect to testing languages accepted by certain restricted machines. Alon et al. [FOCS 1999] have shown that any regular language is testable with a constant number of queries. It is well known that any language in space o(log log n) is regular, thus implying that such languages can be so tested. It was previously known that there are languages in space O(log n) that are not testable with a constant number of queries and Newman [FOCS 2000] raised the question of closing the exponential gap between these two results. A special case of our main result resolves this problem as it implies that there is a language in space O(log log n) that is not testable with a constant number of queries. It was also previously known that the class of testable properties cannot be extended to all context-free languages. We further show that one cannot even extend the family of testable languages to the class of languages accepted by single counter machines.   相似文献   

17.
在数据流应用中,系统经常需要处理大量的滑动窗口连续查询,采用共享滑动窗口技术可以有效节省存储空间,提高系统整体的查询处理能力。但是共享滑动窗口技术会增大单个查询的响应延迟,降低单个查询的服务质量。针对这个问题,论文提出了加权共享滑动窗口的概念,并提出了三种优化的连接执行算法,优先响应重要的滑动窗口查询,从而提高了系统整体的服务质量。理论分析和实验结果表明论文提出的方法是行之有效的。  相似文献   

18.
滑动窗口是一种对最近一段时间内的数据进行挖掘的有效的技术,本文提出一种基于滑动窗口的流数据频繁项挖掘算法.算法采用了链表队列策略大大简化了算法,提高了挖掘的效率.对于给定的阈值S、误差ε和窗口长度n,算法可以检测在窗口内频度超过Sn的数据流频繁项,且使误差在εn以内.算法的空间复杂度为O(ε-1),对每个数据项的处理和查询时间均为O(1).在此基础上,我们还将该算法进行了扩展,可以通过参数的变化得到不同的流数据频繁项挖掘算法,使得算法的时间和空间复杂度之间得到调节.通过大量的实验证明,本文算法比其它类似算法具有更好的精度以及时间和空间效率.  相似文献   

19.
为了提高在同一数据流上同时计算多个连续极值查询(MAX或MIN)时的处理能力,对查询间资源共享技术进行了研究.提出了一种称为"关键点集"的裁剪策略,系统仅需保存少量数据即可满足所有查询的需要.发掘多个查询间的相似性和可共享的计算存储资源,提出了一个多极值查询处理算法MCEQP.采用链表结构实现的该算法,当一个新数据到达时最多需要O(M K)时间即可更新全部K个查询的结果,其中M为关键点集包含数据的个数.MCEQP采用触发器驱动的方式,只在某些特定时刻才需要计算因数据失效引起的查询结果变化,更新K个查询结果所需时间为O(K).理论分析和实验证明,对于滑动窗口数据流上的多个极值查询,MCEQP算法在降低存储开销和提高性能方面均优于现有的通用方法.  相似文献   

20.
面向轨迹数据流的KNN近似查询   总被引:1,自引:0,他引:1       下载免费PDF全文
提出一种基于滑动窗口的K-最近邻(KNN)近似查询算法。将滑动窗口内数据通过聚类划分成若干大小不一的基本窗口,针对每个基本窗口给定一个采样率,对窗口内数据进行偏倚采样,形成数据流摘要,并基于该摘要,采用计算几何平面扫描算法执行分布式最近邻查询。仿真实验结果表明该算法有效,且具有较好的可扩展性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号