首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 390 毫秒
1.
Many applications require the management of spatial data in a multidimensional feature space. Clustering large spatial databases is an important problem, which tries to find the densely populated regions in the feature space to be used in data mining, knowledge discovery, or efficient information retrieval. A good clustering approach should be efficient and detect clusters of arbitrary shape. It must be insensitive to the noise (outliers) and the order of input data. We propose WaveCluster, a novel clustering approach based on wavelet transforms, which satisfies all the above requirements. Using the multiresolution property of wavelet transforms, we can effectively identify arbitrarily shaped clusters at different degrees of detail. We also demonstrate that WaveCluster is highly efficient in terms of time complexity. Experimental results on very large datasets are presented, which show the efficiency and effectiveness of the proposed approach compared to the other recent clustering methods. Received June 9, 1998 / Accepted July 8, 1999  相似文献   

2.
Distance-based outliers: algorithms and applications   总被引:20,自引:0,他引:20  
This paper deals with finding outliers (exceptions) in large, multidimensional datasets. The identification of outliers can lead to the discovery of truly unexpected knowledge in areas such as electronic commerce, credit card fraud, and even the analysis of performance statistics of professional athletes. Existing methods that we have seen for finding outliers can only deal efficiently with two dimensions/attributes of a dataset. In this paper, we study the notion of DB (distance-based) outliers. Specifically, we show that (i) outlier detection can be done efficiently for large datasets, and for k-dimensional datasets with large values of k (e.g., ); and (ii), outlier detection is a meaningful and important knowledge discovery task. First, we present two simple algorithms, both having a complexity of , k being the dimensionality and N being the number of objects in the dataset. These algorithms readily support datasets with many more than two attributes. Second, we present an optimized cell-based algorithm that has a complexity that is linear with respect to N, but exponential with respect to k. We provide experimental results indicating that this algorithm significantly outperforms the two simple algorithms for . Third, for datasets that are mainly disk-resident, we present another version of the cell-based algorithm that guarantees at most three passes over a dataset. Again, experimental results show that this algorithm is by far the best for . Finally, we discuss our work on three real-life applications, including one on spatio-temporal data (e.g., a video surveillance application), in order to confirm the relevance and broad applicability of DB outliers. Received February 15, 1999 / Accepted August 1, 1999  相似文献   

3.
Exploring process data   总被引:2,自引:0,他引:2  
With the growth of computer usage at all levels in the process industries, the volume of available data has also grown enormously, sometimes to levels that render analysis difficult. Most of this data may be characterized as historical in the sense that it was not collected on the basis of experiments designed to test specific statistical hypotheses. Consequently, the resulting datasets are likely to contain unexpected features (e.g. outliers from various sources, unsuspected correlations between variables, etc.). This observation is important for two reasons: first, these data anomalies can completely negate the results obtained by standard analysis procedures, particularly those based on squared error criteria (a large class that includes many SPC and chemometrics techniques). Secondly and sometimes more importantly, an understanding of these data anomalies may lead to extremely valuable insights. For both of these reasons, it is important to approach the analysis of large historical datasets with the initial objective of uncovering and understanding their gross structure and character. This paper presents a brief survey of some simple procedures that have been found to be particularly useful at this preliminary stage of analysis.  相似文献   

4.
Success of anomaly detection, similar to other spatial data mining techniques, relies on neighborhood definition. In this paper, we argue that the anomalous behavior of spatial objects in a neighborhood can be truly captured when both (a) spatial autocorrelation (similar behavior of nearby objects due to proximity) and (b) spatial heterogeneity (distinct behavior of nearby objects due to difference in the underlying processes in the region) are taken into consideration for the neighborhood definition. Our approach begins by generating micro neighborhoods around spatial objects encompassing all the information about a spatial object. We selectively merge these based on spatial relationships accounting for autocorrelation and inferential relationships accounting for heterogeneity, forming macro neighborhoods. In such neighborhoods, we then identify (i) spatio-temporal outliers, where individual sensor readings are anomalous, (ii) spatial outliers, where the entire sensor is an anomaly, and (iii) spatio-temporally coalesced outliers, where a group of spatio-temporal outliers in the macro neighborhood are separated by a small time lag indicating the traversal of the anomaly. We demonstrate the effectiveness of our approach in neighborhood formation and anomaly detection with experimental results in (i) water monitoring and (ii) highway traffic monitoring sensor datasets. We also compare the results of our approach with an existing approach for spatial anomaly detection.  相似文献   

5.
Association rule mining is one of the most important techniques for intelligent system design and has been widely applied in a large number of real applications. However, classical mining algorithms cannot process very large databases in a reasonable amount of time. The sampling approach that processes a subset of the whole database is a viable alternative. Obviously, such an approach cannot extract perfectly accurate rules. Previous works have tried to improve the accuracy by removing “outliers” from the initial sample based on global statistical properties in the sample. In this paper, we take the view that the initial sample may actually consist of multiple possibly overlapping subsets or clusters. It is more reasonable to apply data clustering techniques to the initial sample before outlier removal is performed on the resulting clusters, so that outliers are removed based on local properties of individual clusters. However, clustering transactional data with very high dimensions is a difficult problem by itself. We solve this problem by interpreting locality sensitive hashing as a means for data clustering. Previously proposed algorithms may be then optionally used to remove the outliers in the individual clusters. We propose several concrete algorithms based on this general strategy. Using an extensive set of synthetic data and real datasets, we evaluate our proposed algorithms and find that our proposals exhibit better accuracy or execution time, or both, than previously proposed algorithms.  相似文献   

6.
A fuzzy index for detecting spatiotemporal outliers   总被引:1,自引:1,他引:0  
The detection of spatial outliers helps extract important and valuable information from large spatial datasets. Most of the existing work in outlier detection views the condition of being an outlier as a binary property. However, for many scenarios, it is more meaningful to assign a degree of being an outlier to each object. The temporal dimension should also be taken into consideration. In this paper, we formally introduce a new notion of spatial outliers. We discuss the spatiotemporal outlier detection problem, and we design a methodology to discover these outliers effectively. We introduce a new index called the fuzzy outlier index, FoI, which expresses the degree to which a spatial object belongs to a spatiotemporal neighbourhood. The proposed outlier detection method can be applied to phenomena evolving over time, such as moving objects, pedestrian modelling or credit card fraud.  相似文献   

7.
To cleanse mislabeled examples from a training dataset for efficient and effective induction, most existing approaches adopt a major set oriented scheme: the training dataset is separated into two parts (a major set and a minor set). The classifiers learned from the major set are used to identify noise in the minor set. The obvious drawbacks of such a scheme are twofold: (1) when the underlying data volume keeps growing, it would be either physically impossible or time consuming to load the major set into the memory for inductive learning; and (2) for multiple or distributed datasets, it can be either technically infeasible or factitiously forbidden to download data from other sites (for security or privacy reasons). Therefore, these approaches have severe limitations in conducting effective global data cleansing from large, distributed datasets.In this paper, we propose a solution to bridge the local and global analysis for noise cleansing. More specifically, the proposed effort tries to identify and eliminate mislabeled data items from large or distributed datasets through local analysis and global incorporation. For this purpose, we make use of distributed datasets or partition a large dataset into subsets, each of which is regarded as a local subset and is small enough to be processed by an induction algorithm at one time to construct a local model for noise identification. We construct good rules from each subset, and use the good rules to evaluate the whole dataset. For a given instance I k , two error count variables are used to count the number of times it has been identified as noise by all data subsets. The instance with higher error values will have a higher probability of being a mislabeled example. Two threshold schemes, majority and non-objection, are used to identify and eliminate the noisy examples. Experimental results and comparative studies on both real-world and synthetic datasets are reported to evaluate the effectiveness and efficiency of the proposed approach.A preliminary version of this paper was published in the Proceedings of the 20th International Conference on Machine Learning, Washington D.C., USA, 2003, pp. 920–927.  相似文献   

8.
We investigate algebraic processing strategies for large numeric datasets equipped with a (possibly irregular) grid structure. Such datasets arise, for example, in computational simulations, observation networks, medical imaging, and 2-D and 3-D rendering. Existing approaches for manipulating these datasets are incomplete: The performance of SQL queries for manipulating large numeric datasets is not competitive with specialized tools. Database extensions for processing multidimensional discrete data can only model regular, rectilinear grids. Visualization software libraries are designed to process arbitrary gridded datasets efficiently, but no algebra has been developed to simplify their use and afford optimization. Further, these libraries are data dependent – physical changes to data representation or organization break user programs. In this paper, we present an algebra of gridfields for manipulating arbitrary gridded datasets, algebraic optimization techniques, and an implementation backed by experimental results. We compare our techniques to those of Geographic Information Systems (GIS) and visualization software libraries, using real examples from an Environmental Observation and Forecasting System. We find that our approach can express optimized plans inaccessible to other techniques, resulting in improved performance with reduced programming effort.  相似文献   

9.
胡云  王崇骏  谢俊元  吴骏  周作建 《软件学报》2013,24(11):2710-2720
时序数据集中的社群演化模式是网络行为动力学研究与应用的重要领域.基于社群演化的离群点检测不仅能够发现新颖的异常行为模式,同时也有利于更准确地理解社群的演化趋势.运用成员关于社群隶属关系的变化,提出了社群演化迁移矩阵的概念,研究并揭示了迁移矩阵的若干性质及其与社群结构演化之间的关系.在采用稳健回归M-估计方法进一步优化迁移矩阵降低异常点干扰的同时,对社群演化离群点加以刻画和定义.鉴于复杂网络包含大量随机游走的边缘个体,所定义的离群点综合考虑其在社群中角色的变化和相对于社群总体迁移模式的差异.基于上述思想提出的演化离群点检测算法能够适应各类社群演化趋势,更有效地聚焦和发现大规模社会网络中重要成员的异常演化行为.实验结果表明,所提出的方法能够从大规模社会网络演化序列中发现重要的离群演化模式,并在现实中找到合理的解释.  相似文献   

10.
The K nearest neighbors approach is a viable technique in time series analysis when dealing with ill-conditioned and possibly chaotic processes. Such problems are frequently encountered in, e.g., finance and production economics. More often than not, the observed processes are distorted by nonnormal disturbances, incomplete measurements, etc. that undermine the identification, estimation and performance of multivariate techniques. If outliers can be duly recognized, many crisp statistical techniques may perform adequately as such. Geno-mathematical programming provides a connection between statistical time series theory and fuzzy regression models that may be utilized e.g., in the detection of outliers. In this paper we propose a fuzzy distance measure for detecting outliers via geno-mathematical parametrization. Fuzzy KNN is connected as a linkable library to the genetic hybrid algorithm (GHA) of the author, in order to facilitate the determination of the LR-type fuzzy number for automatic outlier detection in time series data. We demonstrate that GHA[Fuzzy KNN] provides a platform for automatically detecting outliers in both simulated and real world data.  相似文献   

11.
Finding clusters in large datasets is a difficult task. Almost all computationally feasible methods are related to k-means and need a clear partition structure of the data, while most such datasets contain masking outliers and other deviations from the usual models of partitioning cluster analysis. It is possible to look for clusters informally using graphic tools like the grand tour, but the meaning and the validity of such patterns is unclear. In this paper, a three-step-approach is suggested: In the first step, data visualization methods like the grand tour are used to find cluster candidate subsets of the data. In the second step, reproducible clusters are generated from them by means of fixed point clustering, a method to find a single cluster at a time based on the Mahalanobis distance. In the third step, the validity of the clusters is assessed by the use of classification plots. The approach is applied to an astronomical dataset of spectra from the Hamburg/ESO survey.  相似文献   

12.
A novel pruning approach using expert knowledge for data-specific pruning   总被引:1,自引:0,他引:1  
Classification is an important data mining task that discovers hidden knowledge from the labeled datasets. Most approaches to pruning assume that all dataset are equally uniform and equally important, so they apply equal pruning to all the datasets. However, in real-world classification problems, all the datasets are not equal and considering equal pruning rate during pruning tends to generate a decision tree with large size and high misclassification rate. We approach the problem by first investigating the properties of each dataset and then deriving data-specific pruning value using expert knowledge which is used to design pruning techniques to prune decision trees close to perfection. An efficient pruning algorithm dubbed EKBP is proposed and is very general as we are free to use any learning algorithm as the base classifier. We have implemented our proposed solution and experimentally verified its effectiveness with forty real world benchmark dataset from UCI machine learning repository. In all these experiments, the proposed approach shows it can dramatically reduce the tree size while enhancing or retaining the level of accuracy.  相似文献   

13.
Narayanan  Arvind  Verma  Saurabh  Zhang  Zhi-Li 《World Wide Web》2019,22(6):2771-2798

We coin the term geoMobile data to emphasize datasets that exhibit geo-spatial features reflective of human behaviors. We propose and develop an EPIC framework to mine latent patterns from geoMobile data and provide meaningful interpretations: we first ‘E’xtract latent features from high dimensional geoMobile datasets via Laplacian Eigenmaps and perform clustering in this latent feature space; we then use a state-of-the-art visualization technique to ‘P’roject these latent features into 2D space; and finally we obtain meaningful ‘I’nterpretations by ‘C’ulling cluster-specific significant feature-set. We illustrate that the local space contraction property of our approach is most superior than other major dimension reduction techniques. Using diverse real-world geoMobile datasets, we show the efficacy of our framework via three case studies.

  相似文献   

14.
Estimating the selectivity of multidimensional range queries over real valued attributes has significant applications in data exploration and database query optimization. In this paper, we consider the following problem: given a table of d attributes whose domain is the real numbers and a query that specifies a range in each dimension, find a good approximation of the number of records in the table that satisfy the query. The simplest approach to tackle this problem is to assume that the attributes are independent. More accurate estimators try to capture the joint data distribution of the attributes. In databases, such estimators include the construction of multidimensional histograms, random sampling, or the wavelet transform. In statistics, kernel estimation techniques are being used. Many traditional approaches assume that attribute values come from discrete, finite domains, where different values have high frequencies. However, for many novel applications (as in temporal, spatial, and multimedia databases) attribute values come from the infinite domain of real numbers. Consequently, each value appears very infrequently, a characteristic that affects the behavior and effectiveness of the estimator. Moreover, real-life data exhibit attribute correlations that also affect the estimator. We present a new histogram technique that is designed to approximate the density of multidimensional datasets with real attributes. Our technique defines buckets of variable size and allows the buckets to overlap. The size of the cells is based on the local density of the data. The use of overlapping buckets allows a more compact approximation of the data distribution. We also show how to generalize kernel density estimators and how to apply them to the multidimensional query approximation problem. Finally, we compare the accuracy of the proposed techniques with existing techniques using real and synthetic datasets. The experimental results show that the proposed techniques behave more accurately in high dimensionalities than previous approaches.Received: 30 January 2001, Accepted: 9 June 2003, Published online: 4 March 2004Edited by: Y. IoannidisDimitrios Gunopulos: Supported by NSF ITR-0220148, NSF IIS-9907477 CAREER Award, NSF IIS-9984729, and NRDRP.George Kollios: Supported by NSF IIS-0133825 CAREER Award.Vassilis J. Tsotras: Supported by NSF IIS-9907477 and the US Dept. of Defense.  相似文献   

15.
Statistical outlier detection using direct density ratio estimation   总被引:2,自引:2,他引:0  
We propose a new statistical approach to the problem of inlier-based outlier detection, i.e., finding outliers in the test set based on the training set consisting only of inliers. Our key idea is to use the ratio of training and test data densities as an outlier score. This approach is expected to have better performance even in high-dimensional problems since methods for directly estimating the density ratio without going through density estimation are available. Among various density ratio estimation methods, we employ the method called unconstrained least-squares importance fitting (uLSIF) since it is equipped with natural cross-validation procedures, allowing us to objectively optimize the value of tuning parameters such as the regularization parameter and the kernel width. Furthermore, uLSIF offers a closed-form solution as well as a closed-form formula for the leave-one-out error, so it is computationally very efficient and is scalable to massive datasets. Simulations with benchmark and real-world datasets illustrate the usefulness of the proposed approach.  相似文献   

16.
Outlier detection is an imperative field of data mining that has several applications in the field of medical research. Mining outliers based on the notion of rare patterns can be a promising solution for medical diagnosis as it attempts to identify the unconventional and abnormal risk patterns present in medical data. A crucial issue in medical data analysis is the continuous growth of medical databases due to the addition of new records. Existing outlier detection techniques are capable of handling only static data and thus re-execute from scratch to identify the outliers from incremental medical data. This paper introduces an efficient rare pattern based outlier detection (RPOD) method that identifies outliers by mining rare patterns from incremental data. To avoid multiple database scans and expensive candidate generation steps performed by existent rare pattern mining techniques and facilitate incremental mining, a single pass prefix tree-based rare pattern mining technique is proposed. The proposed rare pattern mining technique is a modification of the well-known FP-Growth frequent pattern mining algorithm. Furthermore, to identify the outliers based on the set of generated rare patterns, an outlier detection technique is also presented. The significance of proposed RPOD approach is demonstrated using several well-known medical datasets. Comparative performance evaluation substantiates the predominance of RPOD approach over existing outlier mining methods.  相似文献   

17.
One of the main characteristics of Internet era is the free and online availability of extremely large collections of images located on distributed and heterogeneous platforms over the web. The proliferation of millions of shared photographs spurred the emergence of new image retrieval techniques based not only on images’ visual information, but on geo-location tags and camera exif data. These huge visual collections provide a unique opportunity for cultural heritage documentation and 3D reconstruction. The main difficulty, however, is that the internet image datasets are unstructured containing many outliers. For this reason, in this paper a new content-based image filtering is proposed to discard image outliers that either confuse or significantly delay the followed e-documentation tools, such as 3D reconstruction of a cultural heritage object. The presented approach exploits and fuses two unsupervised clustering techniques: DBSCAN and spectral clustering. DBSCAN algorithm is used to remove outliers from the initially retrieved dataset and spectral clustering discriminate the noise free image dataset into different categories each representing characteristic geometric views of cultural heritage objects. To discard the image outliers, we consider images as points onto a multi-dimensional manifold and the multi-dimensional scaling algorithm is adopted to relate the space of the image distances with the space of Gram matrices through which we are able to compute the image coordinates. Finally, structure from motion is utilized for 3D reconstruction of cultural heritage landmarks. Evaluation on a dataset of about 31,000 cultural heritage images being retrieved from internet collections with many outliers indicate the robustness and cost effectiveness of the proposed method towards a reliable and just-in-time 3D reconstruction than existing state-of-the-art techniques.  相似文献   

18.
Detecting and tracking regional outliers in meteorological data   总被引:1,自引:0,他引:1  
Detecting spatial outliers can help identify significant anomalies in spatial data sequences. In the field of meteorological data processing, spatial outliers are frequently associated with natural disasters such as tornadoes and hurricanes. Previous studies on spatial outliers mainly focused on identifying single location points over a static data frame. In this paper, we propose and implement a systematic methodology to detect and track regional outliers in a sequence of meteorological data frames. First, a wavelet transformation such as the Mexican Hat or Morlet is used to filter noise and enhance the data variation. Second, an image segmentation method, λ-connected segmentation, is employed to identify the outlier regions. Finally, a regression technique is applied to track the center movement of the outlying regions for consecutive frames. In addition, we conducted experimental evaluations using real-world meteorological data and events such as Hurricane Isabel to demonstrate the effectiveness of our proposed approach.  相似文献   

19.
Uncertain data management, querying and mining have become important because the majority of real world data is accompanied with uncertainty these days. Uncertainty in data is often caused by the deficiency in underlying data collecting equipments or sometimes manually introduced to preserve data privacy. This work discusses the problem of distance-based outlier detection on uncertain datasets of Gaussian distribution. The Naive approach of distance-based outlier on uncertain data is usually infeasible due to expensive distance function. Therefore a cell-based approach is proposed in this work to quickly identify the outliers. The infinite nature of Gaussian distribution prevents to devise effective pruning techniques. Therefore an approximate approach using bounded Gaussian distribution is also proposed. Approximating Gaussian distribution by bounded Gaussian distribution enables an approximate but more efficient cell-based outlier detection approach. An extensive empirical study on synthetic and real datasets show that our proposed approaches are effective, efficient and scalable.  相似文献   

20.
Density based clustering techniques like DBSCAN are attractive because it can find arbitrary shaped clusters along with noisy outliers. Its time requirement is O(n2) where n is the size of the dataset, and because of this it is not a suitable one to work with large datasets. A solution proposed in the paper is to apply the leaders clustering method first to derive the prototypes called leaders from the dataset which along with prototypes preserves the density information also, then to use these leaders to derive the density based clusters. The proposed hybrid clustering technique called rough-DBSCAN has a time complexity of O(n) only and is analyzed using rough set theory. Experimental studies are done using both synthetic and real world datasets to compare rough-DBSCAN with DBSCAN. It is shown that for large datasets rough-DBSCAN can find a similar clustering as found by the DBSCAN, but is consistently faster than DBSCAN. Also some properties of the leaders as prototypes are formally established.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号