期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

The <Emphasis Type="Italic">k</Emphasis>-Nearest Neighbour Join: Turbo Charging the KDD Process

Christian?B?hm Email author Florian?Krebs 《Knowledge and Information Systems》2004,6(6):728-749

The similarity join has become an important database primitive for supporting similarity searches and data mining. A similarity join combines two sets of complex objects such that the result contains all pairs of similar objects. Two types of the similarity join are well-known, the distance range join, in which the user defines a distance threshold for the join, and the closest pair query or k-distance join, which retrieves the k most similar pairs. In this paper, we propose an important, third similarity join operation called the k-nearest neighbour join, which combines each point of one point set with its k nearest neighbours in the other set. We discover that many standard algorithms of Knowledge Discovery in Databases (KDD) such as k-means and k-medoid clustering, nearest neighbour classification, data cleansing, postprocessing of sampling-based data mining, etc. can be implemented on top of the k-nn join operation to achieve performance improvements without affecting the quality of the result of these algorithms. We propose a new algorithm to compute the k-nearest neighbour join using the multipage index (MuX), a specialised index structure for the similarity join. To reduce both CPU and I/O costs, we develop optimal loading and processing strategies. 相似文献

2.

Composite Distance Transformation for Indexing and k-Nearest-Neighbor Searching in High-Dimensional Spaces

下载免费PDF全文

Yi Zhuang Yue-Ting Zhuang and Fei Wu 《计算机科学技术学报》2007,22(2):208-217

Due to the famous dimensionality curse problem, search in a high-dimensional space is considered as a ＂hard＂ problem. In this paper, a novel composite distance transformation method, which is called CDT, is proposed to support a fast κ-nearest-neighbor （κ-NN） search in high-dimensional spaces. In CDT, all （n） data points are first grouped into some clusters by a κ-Means clustering algorithm. Then a composite distance key of each data point is computed. Finally, these index keys of such n data points are inserted by a partition-based B^＋-tree. Thus, given a query point, its κ-NN search in high-dimensional spaces is transformed into the search in the single dimensional space with the aid of CDT index. Extensive performance studies are conducted to evaluate the effectiveness and efficiency of the proposed scheme. Our results show that this method outperforms the state-of-the-art high-dimensional search techniques, such as the X-Tree, VA-file, iDistance and NB-Tree. 相似文献

3.

支持隐私保护的k近邻分类器

徐剑王安迪毕猛周福才《软件学报》2019,30(11):3503-3517

k近邻（k-nearest neighbor,简称kNN）分类器在生物信息学、股票预测、网页分类以及鸢尾花分类预测等方面都有着广泛的应用.随着用户隐私保护意识的日益提高,kNN分类器也需要对密文数据提供分类支持,进而保证用户数据的隐私性,即设计一种支持隐私保护的k近邻分类器（privacy-preserving k-nearest neighbor classifier,简称PP-kNN）.首先,对kNN分类器的操作进行分析,从中提取出一些基本操作,包括加法、乘法、比较、内积等.然后,选择两种同态加密方案和一种全同态加密方案对数据进行加密.在此基础上设计了针对基本操作的安全协议,其输出结果与在明文数据上执行同一方法的输出结果一致,且证明该协议在半诚实模型下是安全的.最后,通过将基本操作的安全协议进行模块化顺序组合的方式实现kNN分类器对密文数据处理的支持.通过实验,对所设计的PP-kNN分类器进行测试.结果表明,该分类器能够以较高效率实现对密文数据的分类,同时为用户数据提供隐私性保护. 相似文献

4.

Accelerated k-nearest neighbors algorithm based on principal component analysis for text categorization

Min DU Xing-shu CHEN 《浙江大学学报:C卷英文版》2013,14(6):407-416

Text categorization is a significant technique to manage the surging text data on the Internet.The k-nearest neighbors(kNN) algorithm is an effective,but not efficient,classification model for text categorization.In this paper,we propose an effective strategy to accelerate the standard kNN,based on a simple principle:usually,near points in space are also near when they are projected into a direction,which means that distant points in the projection direction are also distant in the original space.Using the proposed strategy,most of the irrelevant points can be removed when searching for the k-nearest neighbors of a query point,which greatly decreases the computation cost.Experimental results show that the proposed strategy greatly improves the time performance of the standard kNN,with little degradation in accuracy.Specifically,it is superior in applications that have large and high-dimensional datasets. 相似文献

5.

基于数据流的k-近邻连接算法

王飞秦小麟刘亮沈尧《计算机科学》2015,42(5):204-210

k-近邻连接查询是空间数据库中一种常用的操作,该查询处理过程涉及连接和最近邻查询两个复杂操作.传统的集中式k-近邻连接查询算法已不能适应当前呈爆炸式增长的数据规模,设计分布式k-近邻连接查询算法成为了目前亟需解决的问题.现有的分布式k-近邻连接查询算法都包括了多轮串行的MapReduce任务,而每个MapReduce任务均需要读写分布式文件系统,导致MapReduce不能有效表达多个任务之间的依赖关系,因此算法效率低下.首先提出了一种基于数据流的计算框架,该框架建立在MapReduce之上,将数据处理过程按照数据流图建模.在该框架基础上,提出了一种高效的k-近邻连接算法,它利用空间填充曲线将多维数据映射为一维数据,从而将k-近邻连接查询转化为一维范围查询.实验结果表明,该算法的可扩展性较高,且效率比现有算法更优. 相似文献

6.

数据仓库系统中层次式Cube存储结构 总被引：11，自引：0，他引：11

下载免费PDF全文

高宏李建中李金宝《软件学报》2003,14(7):1258-1266

区域查询是数据仓库上支持联机分析处理(on-line analytical processing,简称OLAP)的重要操作.近几年,人们提出了一些支持区域查询和数据更新的Cube存储结构.然而这些存储结构的空间复杂性和时间复杂性都很高,难以在实际中使用.为此,提出了一种层次式Cube存储结构HDC(hierarchical data cube)及其上的相关算法.HDC上区域查询的代价和数据更新代价均为O(log^dn),综合性能为O((logn)^2d)(使用C_qC_u模型)或O(K(logn)^d)(使用C_qn_q+C_un_u模型).理论分析与实验表明,HDC的区域查询代价、数据更新代价、空间代价以及综合性能都优于目前所有的Cube存储结构. 相似文献

7.

A fully discrete local discontinuous Galerkin method for one-dimensional time-fractional Fisher's equation

《国际计算机数学杂志》2012,89(9):2021-2038

In this paper, we consider the local discontinuous Galerkin (LDG) finite element method for one-dimensional time-fractional Fisher's equation, which is obtained from the standard one-dimensional Fisher's equation by replacing the first-order time derivative with a fractional derivative (of order α, with 0<α<1). The proposed LDG is based on the LDG finite element method for space and finite difference method for time. We prove that the method is stable, and the numerical solution converges to the exact one with order O(h^k+1+τ^2?α), where h, τ and k are the space step size, time step size, polynomial degree, respectively. The numerical experiments reveal that the LDG is very effective. 相似文献

8.

Modelisation d'algorithmes d'optimisation a strategie aleatoire

A. Guimier 《Calcolo》1986,23(1):21-43

Conceptual algorithms for random search in optimization. I am proposing two conceptual algorithms for extending results about almost certainly convergence of stochastic algorithms for optimization described as follows. Let f be a map from the vector space E to the set of real number R; f is to be minimized; x⁰ is an arbitrary point of E and (ξ ^k) a family of random vectors, if f(x^k+ξ ^k)≥f(x^k) then x^k+1=x^k or else x^k+1=x^k+ξ ^k. The inspiration for the two conceptual algorithms came from Polak's conceptual algorithm [11] for deterministic search in optimization. 相似文献

9.

New Fixed-Parameter Algorithms for the Minimum Quartet Inconsistency Problem

Maw-Shang Chang Chuang-Chieh Lin Peter Rossmanith 《Theory of Computing Systems》2010,47(2):342-367

Let S be a set of n taxa. Given a parameter k and a set of quartet topologies Q over S such that there is exactly one topology for every subset of four taxa, the parameterized Minimum Quartet Inconsistency (MQI) problem is to decide whether we can find an evolutionary tree inducing a set of quartet topologies that differs from the given set in at most k quartet topologies. The best fixed-parameter algorithm devised so far for the parameterized MQI problem runs in time O(4^k n+n ⁴). In this paper, first we present an O(3.0446^k n+n ⁴) fixed-parameter algorithm and an O(2.0162^k n ³+n ⁵) fixed-parameter algorithm for the parameterized MQI problem. Finally, we give an O ^*((1+ε)^k) fixed-parameter algorithm, where ε>0 is an arbitrarily small constant. 相似文献

10.

A top-k spatial join querying processing algorithm based on spark

《Information Systems》2020

Aiming at the problem of top-k spatial join query processing in cloud computing systems, a Spark-based top-k spatial join (STKSJ) query processing algorithm is proposed. In this algorithm, the whole data space is divided into grid cells of the same size by a grid partitioning method, and each spatial object in one data set is projected into a grid cell. The Minimum Bounding Rectangle (MBR) of all spatial objects in each grid cell is computed. The spatial objects overlapping with these MBRs in another spatial data set are replicated to the corresponding grid cells, thereby filtering out spatial objects for which there are no join results, thus reducing the cost of subsequent spatial join processing. An improved plane sweeping algorithm is also proposed that speeds up the scanning mode and applies threshold filtering, thus greatly reducing the communication and computation costs of intermediate join results in subsequent top-k aggregation operations. Experimental results on synthetic and real data sets show that the proposed algorithm has clear advantages, and better performance than existing top-k spatial join query processing algorithms. 相似文献

11.

Top-k相似连接算法性能优化

王洪亚杨利宏刘晓强《软件学报》2016,27(12):3051-3066

相似连接算法在数据清理、数据集成和重复网页检测等领域有着广泛的应用.现有相似连接算法有两种类型：基于相似度阈值的相似连接和Top-k相似连接.Top-k连接算法非常适合于相似度阈值未知的应用场景,目前最为有效的Top-k相似连接算法是Xiao等人提出的Topk-join.为了解决Topk-join中存在的性能问题,提出了一种Top-k相似连接算法Opt-join,该算法将Token批处理技术集成在现有的事件驱动框架中,以降低前缀事件的处理代价;通过置换哈希查找与过滤操作的执行位置来降低哈希查找代价,并理论证明了该置换的正确性.实验结果表明：与Topk-join算法相比,Opt-join取得了1.28倍~3.09倍的性能提升.实验数据还显示：随着数据长度的增加或k值的增长,Opt-join的性能优势有不断增加的趋势. 相似文献

12.

Supporting top-<Emphasis Type="Italic">k</Emphasis><Emphasis Type="Italic">join</Emphasis> queries in relational databases

Ihab?F.?Ilyas Email author Walid?G.?Aref Ahmed?K.?Elmagarmid 《The VLDB Journal The International Journal on Very Large Data Bases》2004,13(3):207-221

Ranking queries, also known as top-k queries, produce results that are ordered on some computed score. Typically, these queries involve joins, where users are usually interested only in the top-k join results. Top-k queries are dominant in many emerging applications, e.g., multimedia retrieval by content, Web databases, data mining, middlewares, and most information retrieval applications. Current relational query processors do not handle ranking queries efficiently, especially when joins are involved. In this paper, we address supporting top-k join queries in relational query processors. We introduce a new rank-join algorithm that makes use of the individual orders of its inputs to produce join results ordered on a user-specified scoring function. The idea is to rank the join results progressively during the join operation. We introduce two physical query operators based on variants of ripple join that implement the rank-join algorithm. The operators are nonblocking and can be integrated into pipelined execution plans. We also propose an efficient heuristic designed to optimize a top-k join query by choosing the best join order. We address several practical issues and optimization heuristics to integrate the new join operators in practical query processors. We implement the new operators inside a prototype database engine based on PREDATOR. The experimental evaluation of our approach compares recent algorithms for joining ranked inputs and shows superior performance.Received: 23 December 2003, Accepted: 31 March 2004, Published online: 12 August 2004Edited by: S. AbiteboulExtended version of the paper published in the Proceedings of the 29th International Conference on Very Large Databases, VLDB 2003, Berlin, Germany, pp 754-765 相似文献

13.

A feature group weighting method for subspace clustering of high-dimensional data

Xiaojun Chen Yunming Ye Xiaofei Xu Joshua Zhexue Huang 《Pattern recognition》2012,45(1):434-446

This paper proposes a new method to weight subspaces in feature groups and individual features for clustering high-dimensional data. In this method, the features of high-dimensional data are divided into feature groups, based on their natural characteristics. Two types of weights are introduced to the clustering process to simultaneously identify the importance of feature groups and individual features in each cluster. A new optimization model is given to define the optimization process and a new clustering algorithm FG-k-means is proposed to optimize the optimization model. The new algorithm is an extension to k-means by adding two additional steps to automatically calculate the two types of subspace weights. A new data generation method is presented to generate high-dimensional data with clusters in subspaces of both feature groups and individual features. Experimental results on synthetic and real-life data have shown that the FG-k-means algorithm significantly outperformed four k-means type algorithms, i.e., k-means, W-k-means, LAC and EWKM in almost all experiments. The new algorithm is robust to noise and missing values which commonly exist in high-dimensional data. 相似文献

14.

Average-Case Analysis of Dynamic Graph Algorithms

D. Alberts M. R. Henzinger 《Algorithmica》1998,20(1):31-60

We present a model for edge updates with restricted randomness in dynamic graph algorithms and a general technique for analyzing the expected running time of an update operation. This model is able to capture the average case in many applications, since (1) it allows restrictions on the set of edges which can be used for insertions and (2) the type (insertion or deletion) of each update operation is arbitrary, i.e., not random. We use our technique to analyze existing and new dynamic algorithms for the following problems: maximum cardinality matching, minimum spanning forest, connectivity, 2-edge connectivity, k -edge connectivity, k -vertex connectivity, and bipartiteness. Given a random graph G with m ₀ edges and n vertices and a sequence of l update operations such that the graph contains m _i edges after operation i , the expected time for performing the updates for any l is in the case of minimum spanning forests, connectivity, 2-edge connectivity, and bipartiteness. The expected time per update operation is O(n) in the case of maximum matching. We also give improved bounds for k -edge and k -vertex connectivity. Additionally we give an insertions-only algorithm for maximum cardinality matching with worst-case O(n) amortized time per insertion. Received June 11, 1995; revised March 8, 1996. 相似文献

15.

概率数据上基于EMD距离的并行Top-k相似性连接算法

雷斌许嘉谷峪于戈《软件学报》2013,24(S2):188-199

以无线传感器网络为代表的新型数据应用和以图像处理为基础的传统数据应用都产生了大规模的概率数据.在概率数据的管理中,Top-k相似性连接操作返回最相似的k 对概率数据,具有重要应用价值.直方图是最常用的概率数据模型之一,而EMD（Earth Mover’s Distance）距离因其较强的鲁棒性可更准确地量化直方图概率数据之间的相似性.然而EMD距离的计算却具有三次方的时间复杂度,给基于EMD距离的Top-k 相似性连接带来巨大挑战.基于流行的MapReduce并行处理框架,利用EMD距离对偶线性规划问题的优良特性,提出了两种大规模概率数据上基于EMD距离的Top-k相似性连接算法.首先提出基于块嵌套循环连接思想的基本解决方法,命名为Top-k BNLJ算法.进而改进数据划分策略,提出基于数据局部性进行数据划分的Top-k DLPJ 算法,有效降低了MapReduce作业执行过程中的数据传输量.使用大规模真实数据集对两种算法进行评估,证实了本文提出的Top-k DLPJ算法的高效性和处理大规模数据集时的良好扩展性. 相似文献

16.

高速流环境下近似连续k代表轮廓查询算法

下载免费PDF全文

朱睿宋栿尧王斌杨晓春张安珍夏秀峰《软件学报》2023,34(3):1425-1450

k代表轮廓查询是从传统轮廓查询中衍生出来的一类查询.给定多维数据集合D,轮廓查询从D中找到所有不被其他对象支配的对象,将其返回给用户,便于用户结合自身偏好选择高质量对象.然而,轮廓对象规模通常较大,用户需要从大量数据中进行选择,导致选择速度和质量无法得到保证.与传统轮廓查询相比,k代表轮廓查询从所有轮廓对象中选择“代表性”最强的k个对象返回给用户,有效地解决了传统轮廓查询存在的这一问题.给定滑动窗口W和连续查询q,q监听窗口中的数据.当窗口滑动时,查询q返回窗口中,组合支配面积最大的k个对象.现有算法的核心思想是:实时监测当前窗口中的轮廓对象集合,当轮廓对象集合更新时,算法更新k代表轮廓.然而,实时监测窗口中,轮廓集合的计算代价通常较大.此外,当轮廓集合规模较大时,从中选择k代表轮廓的计算代价是同样巨大的,导致已有算法无法在高速流环境下使用.针对上述问题,提出了ρ-近似k代表轮廓查询.为了支持该查询,提出了查询处理框架PAKRS(predict-basedapproximatekrepresentativeskyline).首先,PAKRS利用高速流的特性对当前窗口进行划分,根据划分结... 相似文献

17.

Tensor and border rank of certain classes of matrices and the fast evaluation of determinant inverse matrix and eigenvalues

Dario Bini 《Calcolo》1985,22(1):209-228

The tensor rankA of the linear spaceA generated by the set of linearly independent matricesA ₁, A₂, …, A_p, is the least integert for wich there existt diadsu ^(r) v ^(r)τ, τ=1,2,...,t, such that . IfA=n+k,k≪n then some computational problems concerning matricesA∈A can be solyed fast. For example the parallel inversion of almost any nonsingular matrixA∈A costs 3 logn+0(log² k) steps with max(n ²+p (n+k), k² n+nk) processors, the evaluation of the determinant ofA can be performed by a parallel algorithm in logp+logn+0 (log² k) parallel steps and by a sequential algorithm inn(1+k ²)+p (n+k)+0 (k ³) multiplications. Analogous results hold to accomplish one step of bisection method, Newton's iterations method and shifted inverse power method applied toA−λB in order to compute the (generalized) eigenvalues provided thatA, B∈A. The same results hold if tensor rank is replaced by border rank. Applications to the case of banded Toeplitz matrices are shown. Dedicated to Professor S. Faedo on his 70th birthday Part of the results of this paper has been presented at the Oberwolfach Conference on Komplexitatstheorie, November 1983 相似文献

18.

加权3-Set Packing 的改进算法 总被引：1，自引：0，他引：1

冯启龙王建新陈建二《软件学报》2010,21(4):886-898

Packing 问题构成了一类重要的NP 难问题.对于加权3-Set Packing 问题,把问题转化成加权3-Set Packing Augmentation 问题进行求解,即主要讨论如何从一个已知的最大加权k-packing 求得一个权值最大的(k+1)-packing. 通过对问题结构的分析,结合Color-Coding 技术,首先给出了一种时间复杂度为O^*(10.6^3k)的参数算法,极大地改进了目前文献中的最好结果O^*(12.8^3k).通过对(k+1)-packing 结构的进一步分析,利用集合划分技术将上述结果降到O^*(7.56^3k). 相似文献

19.

Rough set based hybrid algorithm for text classification

Duoqian Miao Qiguo Duan Hongyun Zhang Na Jiao 《Expert systems with applications》2009,36(5):9168-9174

Automatic classification of text documents, one of essential techniques for Web mining, has always been a hot topic due to the explosive growth of digital documents available on-line. In text classification community, k-nearest neighbor (kNN) is a simple and yet effective classifier. However, as being a lazy learning method without premodelling, kNN has a high cost to classify new documents when training set is large. Rocchio algorithm is another well-known and widely used technique for text classification. One drawback of the Rocchio classifier is that it restricts the hypothesis space to the set of linear separable hyperplane regions. When the data does not fit its underlying assumption well, Rocchio classifier suffers. In this paper, a hybrid algorithm based on variable precision rough set is proposed to combine the strength of both kNN and Rocchio techniques and overcome their weaknesses. An experimental evaluation of different methods is carried out on two common text corpora, i.e., the Reuters-21578 collection and the 20-newsgroup collection. The experimental results indicate that the novel algorithm achieves significant performance improvement. 相似文献

20.

Robustness of a discrete dynamical process with a given set of attainability

O. V. Murav’eva 《Journal of Computer and Systems Sciences International》2010,49(6):909-914

A linear discrete system x ^{(k +1)} = A x ^(k) with a given set of attainability (c, x(^N)) ≥ c ₀ is considered. The influence of small variations of problem parameters on the attainability of a required set is investigated. For the control system x ^(k+1) = Ax ^(k) + Bu ^(k), the choice of a controller as a feedback in state u(k) = Kx ^(k) is studied. 相似文献