首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 93 毫秒
1.
Although k-anonymity is a good way of publishing microdata for research purposes, it cannot resist several common attacks, such as attribute disclosure and the similarity attack. To resist these attacks, many refinements of kanonymity have been proposed with t-closeness being one of the strictest privacy models. While most existing t-closeness models address the case in which the original data have only one single sensitive attribute, data with multiple sensitive attributes are more common in practice. In this paper, we cover this gap with two proposed algorithms for multiple sensitive attributes and make the published data satisfy t-closeness. Based on the observation that the values of the sensitive attributes in any equivalence class must be as spread as possible over the entire data to make the published data satisfy t-closeness, both of the algorithms use different methods to partition records into groups in terms of sensitive attributes. One uses a clustering method, while the other leverages the principal component analysis. Then, according to the similarity of quasiidentifier attributes, records are selected from different groups to construct an equivalence class, which will reduce the loss of information as much as possible during anonymization. Our proposed algorithms are evaluated using a real dataset. The results show that the average speed of the first proposed algorithm is slower than that of the second proposed algorithm but the former can preserve more original information. In addition, compared with related approaches, both proposed algorithms can achieve stronger protection of privacy and reduce less.  相似文献   

2.
Selectivity estimation is an important step of query optimization in a database management system, and multi-dimensional histogram techniques have proved promising for selectivity estimation. Recent multi-dimensional histogram techniques such as GenHist and STHoles use an arbitrary bucket layout. This layout has the advantage of requiring a smaller number of buckets to model tuple densities than those required by the traditional grid or recursive layouts. However, the arbitrary bucket layout brings an inherent disadvantage of requiring more memory to store each bucket location information. This diminishes the advantage of requiring fewer buckets and, therefore, has an adverse effect on the resulting selectivity estimation accuracy. To our knowledge, however, no existing histogram-based technique with arbitrary layout addresses this issue. In this paper, we introduce the idea of bucket location compression and then demonstrate its effectiveness for improving selectivity estimation accuracy by proposing the STHoles+ technique. STHoles+ extends STHoles by quantizing each coordinate of a bucket relative to the coordinate of the smallest enclosing bucket. This quantization increases the number of histogram buckets that can be stored in the histogram. Our quantization scheme allows STHoles+ to trade precision of histogram bucket locations for storing more buckets. Experimental results show that STHoles+ outperforms STHoles on various data distributions, query distributions, and other factors such as available memory size, quantization resolution, and dimensionality of the data space.  相似文献   

3.
We propose an incremental technique for discovering duplicates in large databases of textual sequences, i.e., syntactically different tuples, that refer to the same real-world entity. The problem is approached from a clustering perspective: given a set of tuples, the objective is to partition them into groups of duplicate tuples. Each newly arrived tuple is assigned to an appropriate cluster via nearest-neighbor classification. This is achieved by means of a suitable hash-based index, that maps any tuple to a set of indexing keys and assigns tuples with high syntactic similarity to the same buckets. Hence, the neighbors of a query tuple can be efficiently identified by simply retrieving those tuples that appear in the same buckets associated to the query tuple itself, without completely scanning the original database. Two alternative schemes for computing indexing keys are discussed and compared. An extensive experimental evaluation on both synthetic and real data shows the effectiveness of our approach.  相似文献   

4.
Basel II imposes regulatory capital on banks related to the default risk of their credit portfolio. Banks using an internal rating approach compute the regulatory capital from pooled probabilities of default. These pooled probabilities can be calculated by clustering credit borrowers into different buckets and computing the mean PD for each bucket. The clustering problem can become very complex when Basel II regulations and real-world constraints are taken into account. Search heuristics have already proven remarkable performance in tackling this problem. A Threshold Accepting algorithm is proposed, which exploits the inherent discrete nature of the clustering problem. This algorithm is found to outperform alternative methodologies already proposed in the literature, such as standard k-means and Differential Evolution. Besides considering several clustering objectives for a given number of buckets, we extend the analysis further by introducing new methods to determine the optimal number of buckets in which to cluster banks’ clients.  相似文献   

5.
A balanced multiple-valued filing scheme is designed using the standard array of a linear (n, k) code. It is valid for all order queries ?n. The distribution of records over buckets and redundancy vary with code selection. Unlike other such schemes, the storage and retrieval algorithms are defined for any set of parameters; moreover, the algorithms are simple.  相似文献   

6.
为了解决多维数值型敏感属性数据隐私保护方法中存在的准标识符属性信息损失大,以及不能满足用户对数值型敏感属性重要性排序的个性化需求问题,提出一种基于聚类和加权多维桶分组(MSB)的个性化隐私保护方法。首先,根据准标识符的相似程度,将数据集划分成若干准标识符属性值相近的子集;然后,考虑到用户对敏感属性的敏感程度不同,将敏感程度和多维桶的桶容量用于计算加权选择度和构建加权多维桶;最后,依此对数据进行分组和匿名化处理。选用UCI的标准Adult数据集中的8个属性进行实验,并与基于聚类和多维桶的数据隐私保护方法MNSACM和基于聚类和加权多维桶分组的个性化隐私保护方法WMNSAPM进行对比。实验结果表明,所提方法整体较优,并且在减少信息损失和运行时间方面明显优于对比方法,提高了数据质量和运行效率。  相似文献   

7.
The problem of record linkage is to identify records from two datasets, which refer to the same entities (e.g. patients). A particular issue of record linkage is the presence of missing values in records, which has not been fully addressed. Another issue is how privacy and confidentiality can be preserved in the process of record linkage. In this paper, we propose an approach for privacy preserving record linkage in the presence of missing values. For any missing value in a record, our approach imputes the similarity measure between the missing value and the value of the corresponding field in any of the possible matching records from another dataset. We use the k-NNs (k Nearest Neighbours in the same dataset) of the record with the missing value and their distances to the record for similarity imputation. For privacy preservation, our approach uses the Bloom filter protocol in the settings of both standard privacy preserving record linkage without missing values and privacy preserving record linkage with missing values. We have conducted an experimental evaluation using three pairs of synthetic datasets with different rates of missing values. Our experimental results show the effectiveness and efficiency of our proposed approach.  相似文献   

8.
Combinatorial testing is as an effective testing technique to reveal failures in a given system, based on input combinations coverage and combinatorial optimization. Combinatorial testing of strength t (t ≥ 2) requires that each t-wise tuple of values of the different system input parameters is covered by at least one test case. Combinatorial test suite generation algorithms aim at producing a test suite covering all the required tuples in a small (possibly minimal) number of test cases, in order to reduce the cost of testing. The most used combinatorial technique is the pairwise testing (t = 2) which requires coverage of all pairs of input values. Constrained combinatorial testing takes also into account constraints over the system parameters, for instance forbidden tuples of inputs, modeling invalid or not realizable input values combinations. In this paper a new approach to combinatorial testing, tightly integrated with formal logic, is presented. In this approach, test predicates are used to formalize combinatorial testing as a logical problem, and an external formal logic tool is applied to solve it. Constraints over the input domain are expressed as logical predicates too, and effectively handled by the same tool. Moreover, inclusion or exclusion of select tuples is supported, allowing the user to customize the test suite layout. The proposed approach is supported by a prototype tool implementation and results of experimental assessment are also presented.  相似文献   

9.
Consider a random-access file which consists of a given number of buckets. Each bucket contains a fixed number of slots. Storage and retrieval of records are performed using linear probing. The probabilities underlying the behavior of this addressing system are determined, and the relevant performance measures are derived.  相似文献   

10.
Uncertain data are inevitable in many applications due to various factors such as the limitations of measuring equipment and delays in data updates. Although modeling and querying uncertain data have recently attracted considerable attention from the database community, there are still many critical issues to be resolved with respect to conducting advanced analysis on uncertain data. In this paper, we study the execution of the probabilistic skyline query over uncertain data streams. We propose a novel sliding window skyline model where an uncertain tuple may take the probability to be in the skyline at a certain timestamp t. Formally, a Wp-Skyline(p, t) contains all the tuples whose probabilities of becoming skylines are at least p at timestamp t. However, in the stream environment, computing a probabilistic skyline on a large number of uncertain tuples within the sliding window is a daunting task in practice. In order to efficiently calculate Wp-Skyline, we propose an efficient and effective approach, namely the candidate list approach, which maintains lists of candidates that might become skylines in future sliding windows. We also propose algorithms that continuously monitor the newly incoming and expired data to maintain the skyline candidate set incrementally. To further reduce the computation cost of deciding whether or not a candidate tuple belongs to the skyline, we propose an enhanced refinement strategy that is based on a multi-dimensional indexing structure combined with a grouping-and-conquer strategy. To validate the effectiveness of our proposed approach, we conduct extensive experiments on both real and synthetic data sets and make comparisons with basic techniques.  相似文献   

11.
Agarwal  Bhattacharya  Sen 《Algorithmica》2002,32(4):521-539
We consider the following one- and two-dimensional bucketing problems: Given a set S of n points in \reals 1 or \reals 2 and a positive integer b , distribute the points of S into b equal-size buckets so that the maximum number of points in a bucket is minimized. Suppose at most (n/b) + Δ points lie in each bucket in an optimal solution. We present algorithms whose time complexities depend on b and Δ . No prior knowledge of Δ is necessary for our algorithms. For the one-dimensional problem, we give a deterministic algorithm that achieves a running time of O(b 4 2 +log n) + n) . For the two-dimensional problem, we present a Monte Carlo algorithm that runs in subquadratic time for small values of b and Δ . The previous algorithms, by Asano and Tokuyama [1], searched the entire parameterized space and required Ω ( n 2 ) time in the worst case even for constant values of b and Δ . We also present a subquadratic algorithm for the special case of the two-dimensional problem when b=2 .  相似文献   

12.
左开中  尚宁  陶健  王涛春 《计算机应用》2017,37(6):1599-1604
感知节点感知数据易受外界环境影响,使得不完全数据广泛存在于无线传感器网络中,且感知数据面临严重的隐私威胁。针对两层传感器网络不完全数据查询过程中存在的隐私泄露问题,提出一种基于置换和桶技术的两层传感器网络隐私保护的不完全数据Skyline查询协议(PPIS)。为了实现对不完全数据的Skyline查询,PPIS将缺失属性值置换为数据域的上界值,并将不完全数据映射到桶中;为了保证数据隐私性,PPIS首先将桶区间转化为前缀编码,然后将前缀编码加载到Bloom过滤器中,保证存储节点在无需数据和桶区间明文的前提下执行查询处理;为了保证查询结果的完整性,PPIS采用Merkle哈希树构造完整性验证编码,实现对查询结果的完整性验证。理论分析和仿真实验验证了PPIS的安全性和有效性,与现有隐私保护Skyline查询协议SMQ和SSQ相比,PPIS通信能耗节省了70%以上。  相似文献   

13.
We address the problem of load shedding for continuous multi-way join queries over multiple data streams. When the arrival rates of tuples from data streams exceed the system capacity, a load shedding algorithm drops some subset of input tuples to avoid system overloads. To decide which tuples to drop among the input tuples, most existing load shedding algorithms determine the priority of each input tuple based on the frequency or some historical statistics of its join attribute value, and then drop tuples with the lowest priority. However, those value-based algorithms cannot determine the priorities of tuples properly in environments where join attribute values are unique and each join attribute value occurs at most once in each data stream. In this paper, we propose a load shedding algorithm specifically designed for such environments. The proposed load shedding algorithm determines the priority of each tuple based on the order of streams in which its join attribute value appears, rather than its join attribute value itself. Consequently, the priorities of tuples can be determined effectively in environments where join attribute values are unique and do not repeat. The experimental results show that the proposed algorithm outperforms the existing algorithms in such environments in terms of effectiveness and efficiency.  相似文献   

14.
Generalization is an important technique for protecting privacy in data dissemination. In the framework of generalization, ?-diversity is a strong notion of privacy. However, since existing ?-diversity measures are defined in terms of the most specific (rather than general) sensitive attribute (SA) values, algorithms based on these measures can have narrow eligible ranges for data that has a heavily skewed distribution of SA values and produce anonymous data that has a low utility. In this paper, we propose a new ?-diversity measure called the functional (τ, ?)-diversity, which extends ?-diversity by using a simple function to constrain frequencies of base SA values that are induced by general SA values. As a result, algorithms based on (τ, ?)-diversity may generalize SA values, thus are much less constrained by skew SA distributions. We show that (τ, ?)-diversity is more flexible and elaborate than existing ?-diversity measures. We present an efficient heuristic algorithm that uses a novel order of quasi-identifier (QI) values to achieve (τ, ?)-diversity. We compare our algorithm with two state-of-the-art algorithms that are based on existing ?-diversity measures. Our preliminary experimental results indicate that our algorithm not only provides a stronger privacy protection but also results in better utility of anonymous data.  相似文献   

15.
Efficiently extendible mappings for balanced data distribution   总被引:2,自引:0,他引:2  
In data storage applications, a large collection of consecutively numbered data “buckets” are often mapped to a relatively small collection of consecutively numbered storage “bins.” For example, in parallel database applications, buckets correspond to hash buckets of data and bins correspond to database nodes. In disk array applications, buckets correspond to logical tracks and bins correspond to physical disks in an array. Measures of the “goodness” of a mapping method include:
  1. Thetime (number of operations) needed to compute the mapping.
  2. Thestorage needed to store a representation of the mapping.
  3. Thebalance of the mapping, i.e., the extent to which all bins receive the same number of buckets.
  4. The cost ofrelocation, that is, the number of buckets that must be relocated to a new bin if a new mapping is needed due to an expansion of the number of bins or the number of buckets.
One contribution of this paper is to give a new mapping method, theInterval-Round-Robin (IRR) method. The IRR method has optimal balance and relocation cost, and its time complexity and storage requirements compare favorably with known methods. Specifically, ifm is the number of times that the number of bins and/or buckets has increased, then the time complexity isO(logm) and the storage isO(m 2). Another contribution of the paper is to identify the concept of ahistory-independent mapping, meaning informally that the mapping does not “remember” the past history of expansions to the number of buckets and bins, but only the current number of buckets and bins. Thus, such mappings require very little information to be stored. Assuming that balance and relocation are optimal, we prove that history-independent mappings are possible if the number of buckets is fixed (so only the number of bins can increase), but not possible if the number of bins and buckets can both increase.  相似文献   

16.
k-Anonymity is a method for providing privacy protection by ensuring that data cannot be traced to an individual. In a k-anonymous dataset, any identifying information occurs in at least k tuples. To achieve optimal and practical k-anonymity, recently, many different kinds of algorithms with various assumptions and restrictions have been proposed with different metrics to measure quality. This paper evaluates a family of clustering-based algorithms that are more flexible and even attempts to improve precision by ignoring the restrictions of user-defined Domain Generalization Hierarchies. The evaluation of the new approaches with respect to cost metrics shows that metrics may behave differently with different algorithms and may not correlate with some applications’ accuracy on output data.  相似文献   

17.
Privacy issues represent a longstanding problem nowadays. Measures such as k-anonymity, l-diversity and t-closeness are among the most used ways to protect released data. This work proposes to extend these three measures when the data are protected using fuzzy sets instead of intervals or representative elements. The proposed approach is then tested using Energy Information Authority data set and different fuzzy partition methods. Results shows an improvement in protecting data when data are encoded using fuzzy sets.  相似文献   

18.
A hybrid genetic algorithm is used to find high-order equivalent circuits (ECs) of synchronous machines using standstill frequency response (SSFR) data. The algorithm performs satisfactorily despite the great deal of local minima surrounding the optimal solution of high-order ECs. It gives circuit parameters that simultaneously fit the three independent transfer functions given by the d-axis two-port network of the synchronous machine. It is found that as the order of the EC is increased, the optimization index used in the identification procedure is enhanced in a clear fashion. This leads to a new way for determining the right number of rotor branches required to correctly reproduce the SSFR data. The q-axis network is also analyzed with the hybrid algorithm. The so-called Canay's inductances are included in this one-port network to test if the fitting properties of the q-axis EC can be improved. The SSFR data used in this work is generated by a finite element model of a turbine generator, but actual data can also be readily handled.  相似文献   

19.
Histograms are used to summarize the contents of relations into a number of buckets for the estimation of query result sizes. Several techniques have been proposed in the past for determining bucket boundaries which provide accurate estimations. However, while search strategies for optimal bucket boundaries are rather sophisticated, no much attention has been paid for estimating queries inside buckets and all of the above techniques adopt naive methods for such an estimation. This paper focuses on the problem of improving the estimation inside a bucket once its boundaries have been fixed. The proposed technique is based on the addition, to each bucket, of a memory-word additional information (organized into a tree-like index), storing approximate cumulative frequencies in a hierarchical fashion. Both theoretical analysis and experimental results show that the proposed approach improves the accuracy of the estimation inside buckets, w.r.t. both classical approaches (like continuous value assumption and uniform spread assumption) and a number of alternative ways to organize the additional information. The index is later added to state-of-the-art histograms obtaining the non-obvious result that despite the spatial overhead which reduces the number of allowed buckets once the storage space has been fixed, the original methods are strongly improved in terms of accuracy. An abridged version of this paper appeared in the Proceedings of the International Conference on Data Engineering (ICDE 2002), IEEE Computer Society 2002, ISBN 0-7695-1531-2 [3].  相似文献   

20.
In systems coordinated with a distributed set of tuple spaces, it is crucial to assist agents in retrieving the tuples they are interested in. This can be achieved by sorting techniques that group similar tuples together in the same tuple space, so that the position of a tuple can be inferred by similarity. Accordingly, we formulate the collective sort problem for distributed tuple spaces, where a set of agents is in charge of moving tuples up to a complete sort has been reached, namely, each of the N tuple spaces aggregate tuples belonging to one of the N kinds available. After pointing out the requirements for effectively tackling this problem, we propose a self-organizing solution resembling brood sorting performed by ants. This is based on simple agents that perform partial observations and accordingly take decisions on tuple movement. Convergence is addressed by a fully adaptive method for simulated annealing, based on noise tuples inserted and removed by agents on a need basis so as to avoid sub-optimal sorting. Emergence of sorting properties and scalability are evaluated through stochastic simulations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号