首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 328 毫秒
1.
Dimensionality reduction is a very important tool in data mining. Intrinsic dimension of data sets is a key parameter for dimensionality reduction. However, finding the correct intrinsic dimension is a challenging task. In this paper, a new intrinsic dimension estimation method is presented. The estimator is derived by finding the exponential relationship between the radius of an incising ball and the number of samples included in the ball. The method is compared with the previous dimension estimation methods. Experiments have been conducted on synthetic and high dimensional image data sets and on data sets of the Santa Fe time series competition, and the results show that the new method is accurate and robust.  相似文献   

2.
基于混合概率PCA模型高光谱图像本征维数确定   总被引:1,自引:1,他引:1       下载免费PDF全文
普鑫 《计算机工程》2007,33(9):204-206
如何有效实现降维是现代成像光谱仪辨识地物类别的一个难点所在。该文在已知高光谱图像地物类别数的情况下,提出了一种采用混合最小描述长度(MMDL)模型选择准则确定高光谱图像本征维数的方法。该方法在期望最大化算法框架下同时实现混合PPCA降维和聚类,并根据MMDL准则确定数据降维维数,可以得到数据在概率意义下的精确的降维表征。仿真数据和真实数据进行的比较实验表明,该方法能精确地选择数据的本征维数。  相似文献   

3.
Ruan L  Yuan M  Zou H 《Neural computation》2011,23(6):1605-1622
Finite gaussian mixture models are widely used in statistics thanks to their great flexibility. However, parameter estimation for gaussian mixture models with high dimensionality can be challenging because of the large number of parameters that need to be estimated. In this letter, we propose a penalized likelihood estimator to address this difficulty. The [Formula: see text]-type penalty we impose on the inverse covariance matrices encourages sparsity on its entries and therefore helps to reduce the effective dimensionality of the problem. We show that the proposed estimate can be efficiently computed using an expectation-maximization algorithm. To illustrate the practical merits of the proposed method, we consider its applications in model-based clustering and mixture discriminant analysis. Numerical experiments with both simulated and real data show that the new method is a valuable tool for high-dimensional data analysis.  相似文献   

4.
Information retrieval today is much more challenging than traditional small document retrieval. The main difference is the importance of correlations between related concepts in complex data structures. As collections of data grow and contain more entries, they require more complex relationships, links, and groupings between individual entries. This paper introduces two novel methods for estimating data intrinsic dimensionality based on the singular value decomposition (SVD). The average standard estimator (ASE) and the multi-criteria decision weighted model are used to estimate matrix intrinsic dimensionality for large document collections. The multi-criteria weighted model calculates the sum of weighted values of matrix dimensions which demonstrated best performance using all possible dimensions [1]. ASE estimates the level of significance for singular values that resulted from the singular value decomposition. ASE assumes that those variables with deep relations have sufficient correlation and that only those relationships with high singular values are significant and should be maintained [1]. Experimental results indicate that ASE improves precision and relative relevance for MEDLINE document collection by 10.2% and 12.9% respectively compared to the percentage of variance dimensionality estimation. Results based on testing three document collections over all possible dimensions using selected performance measures indicate that ASE improved matrix intrinsic dimensionality estimation by including the effect of both singular values magnitude of decrease and random noise distracters. The multi-criteria weighted model with dimensionality reduction provides a more efficient implementation for information retrieval than using a full rank model.  相似文献   

5.
高维数据流形的低维嵌入及嵌入维数研究   总被引:29,自引:0,他引:29  
发现高维数据空间流形中有意义的低维嵌入是一个经典难题.Isomap是提出的一种有效的基于流形理论的非线性降维方法,它不仅能够揭示高维观察数据的内在结构,还能够发现潜在的低维参教空间.Isomap的理论基础是假设在高维数据空间和低维参数空间存在等距映射,但并没有进行证明.首先给出了高维数据的连续流形和低维参数空间之间的等距映射存在性证明,然后区分了嵌入空间维数、高维数据空间的固有维数和流形维数,并证明存在环状流形高维数据空间的参数空间维数小于嵌入空间维数.最后提出一种环状流形的发现算法,判断高维数据空间是否存在环状流形,进而估计其固有维教及潜在空间维数.在多姿态三维对象的实验中证明了算法的有效性,并得到正确的低维参数空间.  相似文献   

6.
In the past decade the development of automatic intrinsic dimensionality estimators has gained considerable attention due to its relevance in several application fields. However, most of the proposed solutions prove to be not robust on noisy datasets, and provide unreliable results when the intrinsic dimensionality of the input dataset is high and the manifold where the points are assumed to lie is nonlinearly embedded in a higher dimensional space. In this paper we propose a novel intrinsic dimensionality estimator (DANCo) and its faster variant (FastDANCo), which exploit the information conveyed both by the normalized nearest neighbor distances and by the angles computed on couples of neighboring points. The effectiveness and robustness of the proposed algorithms are assessed by experiments on synthetic and real datasets, by the comparative evaluation with state-of-the-art methodologies, and by significance tests.  相似文献   

7.
To reduce the curse of dimensionality arising from nonparametric estimation procedures for multiple nonparametric regression, in this paper we suggest a simulation-based two-stage estimation. We first introduce a simulation-based method to decompose the multiple nonparametric regression into two parts. The first part can be estimated with the parametric convergence rate and the second part is small enough so that it can be approximated by orthogonal basis functions with a small trade-off parameter. Then the linear combination of the first and second step estimators results in a two-stage estimator for the multiple regression function. Our method does not need any specified structural assumption on the regression function and it is proved that the newly proposed estimation is always consistent even if the trade-off parameter is designed to be small. Thus when the common nonparametric estimator such as local linear smoothing collapses because of the curse of dimensionality, our estimator still works well.  相似文献   

8.
子空间半监督Fisher判别分析   总被引:3,自引:2,他引:1  
杨武夷  梁伟  辛乐  张树武 《自动化学报》2009,35(12):1513-1519
Fisher判别分析寻找一个使样本数据类间散度与样本数据类内散度比值最大的子空间, 是一种很流行的监督式特征降维方法. 标注样本数据所属的类别通常需要大量的人工, 消耗大量的时间, 付出昂贵的成本. 为了解决同时利用有类别信息的样本数据和没有类别信息的样本数据用于寻找降维子空间的问题, 我们提出了一种子空间半监督Fisher判别分析方法. 子空间半监督Fisher判别分析寻找这样一个子空间, 这个子空间即保留了从有类别信息的样本数据中学习的类别判别结构, 也保留了从有类别信息的样本数据和没有类别信息的样本数据中学习的样本结构信息. 我们还推导了基于核的子空间半监督Fisher判别分析方法. 通过人脸识别实验验证了本文算法的有效性.  相似文献   

9.
An evaluation of intrinsic dimensionality estimators   总被引:4,自引:0,他引:4  
The intrinsic dimensionality of a data set may be useful for understanding the properties of classifiers applied to it and thereby for the selection of an optimal classifier. In this paper the authors compare the algorithms for two estimators of the intrinsic dimensionality of a given data set and extend their capabilities. One algorithm is based on the local eigenvalues of the covariance matrix in several small regions in the feature space. The other estimates the intrinsic dimensionality from the distribution of the distances from an arbitrary data vector to a selection of its neighbors. The characteristics of the two estimators are investigated and the results are compared. It is found that both can be applied successfully, but that they might fail in certain cases. The estimators are compared and illustrated using data generated from chromosome banding profiles  相似文献   

10.
An important goal in cluster analysis is the internal validation of results using an objective criterion. Of particular relevance in this respect is the estimation of the optimum number of clusters capturing the intrinsic structure of your data. This paper proposes a method to determine this optimum number based on the evaluation of fuzzy partition stability under bootstrap resampling. The method is first characterized on synthetic data with respect to hyper-parameters, like the fuzzifier, and spatial clustering parameters, such as feature space dimensionality, clusters degree of overlap, and number of clusters. The method is then validated on experimental datasets. Furthermore, the performance of the proposed method is compared to that obtained using a number of traditional fuzzy validity rules based on the cluster compactness-to-separation criteria. The proposed method provides accurate and reliable results, and offers better generalization capabilities than the classical approaches.  相似文献   

11.
In this paper we present a new patch-based empirical Bayesian video denoising algorithm. The method builds a Bayesian model for each group of similar space-time patches. These patches are not motion-compensated, and therefore avoid the risk of inaccuracies caused by motion estimation errors. The high dimensionality of spatiotemporal patches together with a limited number of available samples poses challenges when estimating the statistics needed for an empirical Bayesian method. We therefore assume that groups of similar patches have a low intrinsic dimensionality, leading to a spiked covariance model. Based on theoretical results about the estimation of spiked covariance matrices, we propose estimators of the eigenvalues of the a priori covariance in high-dimensional spaces as simple corrections of the eigenvalues of the sample covariance matrix. We demonstrate empirically that these estimators lead to better empirical Wiener filters. A comparison on classic benchmark videos demonstrates improved visual quality and an increased PSNR with respect to state-of-the-art video denoising methods.  相似文献   

12.
Fisher discriminant analysis gives the unsatisfactory results if points in the same class have within-class multimodality and fails to produce the non-negativity of projection vectors. In this paper, we focus on the newly formulated within and between-class scatters based supervised locality preserving dimensionality reduction problem and propose an effective dimensionality reduction algorithm, namely, Multiplicative Updates based non-negative Discriminative Learning (MUNDL), which optimally seeks to obtain two non-negative embedding transformations with high preservation and discrimination powers for two data sets in different classes such that nearby sample pairs in the original space compact in the learned embedding space, under which the projections of the original data in different classes can be appropriately separated from each other. We also show that MUNDL can be easily extended to nonlinear dimensionality reduction scenarios by employing the standard kernel trick. We verify the feasibility and effectiveness of MUNDL by conducting extensive data visualization and classification experiments. Numerical results on some benchmark UCI and real-world datasets show the MUNDL method tends to capture the intrinsic local and multimodal structure characteristics of the given data and outperforms some established dimensionality reduction methods, while being much more efficient.  相似文献   

13.
A new method for analyzing the intrinsic dimensionality (ID) of low-dimensional manifolds in high-dimensional feature spaces is presented. Compared to a previous approach by Fukunaga and Olsen (1971), the method has only linear instead of cubic time complexity with respect to the dimensionality of the input space. Moreover, it is less sensitive to noise than the former approach. Experiments include ID estimation of synthetic data for comparison and illustration as well as ID estimation of an image sequence  相似文献   

14.
Dimensionality reduction is a great challenge in high dimensional unlabelled data processing. The existing dimensionality reduction methods are prone to employing similarity matrix and spectral clustering algorithm. However, the noises in original data always make the similarity matrix unreliable and degrade the clustering performance. Besides, existing spectral clustering methods just focus on the local structures and ignore the global discriminative information, which may lead to overfitting in some cases. To address these issues, a novel unsupervised 2-dimensional dimensionality reduction method is proposed in this paper, which incorporates the similarity matrix learning and global discriminant information into the procedure of dimensionality reduction. Particularly, the number of the connected components in the learned similarity matrix is equal to cluster number. We compare the proposed method with several 2-dimensional unsupervised dimensionality reduction methods and evaluate the clustering performance by K-means on several benchmark data sets. The experimental results show that the proposed method outperforms the state-of-the-art methods.  相似文献   

15.
针对环状流形数据的非线性降维   总被引:1,自引:0,他引:1  
孟德宇  古楠楠  徐宗本  梁怡 《软件学报》2008,19(11):2908-2920
近年来出现了多种新型的非线性降维方法,且在一些应用中体现出良好的效果.然而,当面对球体、柱体等环状流形产生的非线性流形数据时,这些方法往往会失效.针对这一问题,提出了针对环状流形数据的环结构检测算法与非线性降维方法.理论上,基于目前极受关注的Isomap降维方法的运行原理,给出了一个判断环状流形的充要条件;算法上利用所得的判断定理,制订了基于数据的环状流形检测算法:最后基于所找到的环结构,利用极坐标展开的思想设计了针对环状流形数据的非线性降维策略.针对一系列典型环状流形数据集的仿真实验结果表明,与其他流形学习降维方法相比,该方法对环状流形数据进行降维具有显著优势.  相似文献   

16.
It is not uncommon to encounter a randomized clinical trial (RCT), in which we need to account for both the noncompliance of patients to their assigned treatment and confounders to avoid making a misleading inference. In this paper, we focus our attention on estimation of the relative treatment efficacy measured by the odds ratio (OR) in large strata for a stratified RCT with noncompliance. We have developed five asymptotic interval estimators for the OR. We employ Monte Carlo simulation to evaluate the finite-sample performance of these interval estimators in a variety of situations. We note that the interval estimator using the weighted least squares (WLS) method may perform well when the number of strata is small, but tend to be liberal when the number of strata is large. We find that the interval estimator using weights which are not functions of unknown parameters required to be estimated from data can improve the accuracy of the interval estimator based on the WLS method, but lose precision. We note that the estimator using the logarithmic transformation of the WLS point estimator and the interval estimator using the logarithmic transformation of the Mantel-Haenszel (MH) type of point estimator can perform well with respect to both the coverage probability and the average length in all the situations considered here. We further note that the interval estimator derived from a quadratic equation using a randomization-based method can be of use as the number of strata is large. Finally, we use the data taken from a multiple risk factor intervention trial to illustrate the use of interval estimators appropriate for being employed when the number of strata is small or moderate.  相似文献   

17.
Spatial queries in high-dimensional spaces have been studied extensively. Among them, nearest neighbor queries are important in many settings, including spatial databases (Find the k closest cities) and multimedia databases (Find the k most similar images). Previous analyses have concluded that nearest-neighbor search is hopeless in high dimensions due to the notorious “curse of dimensionality”. We show that this may be overpessimistic. We show that what determines the search performance (at least for R-tree-like structures) is the intrinsic dimensionality of the data set and not the dimensionality of the address space (referred to as the embedding dimensionality). The typical (and often implicit) assumption in many previous studies is that the data is uniformly distributed, with independence between attributes. However, real data sets overwhelmingly disobey these assumptions; rather, they typically are skewed and exhibit intrinsic (“fractal”) dimensionalities that are much lower than their embedding dimension, e.g. due to subtle dependencies between attributes. We show how the Hausdorff and Correlation fractal dimensions of a data set can yield extremely accurate formulas that can predict the I/O performance to within one standard deviation on multiple real and synthetic data sets  相似文献   

18.
Considerable intellectual progress has been made to the development of various semiparametric varying-coefficient models over the past ten to fifteen years. An important advantage of these models is that they avoid much of the curse of dimensionality problem as the nonparametric functions are restricted only to some variables. More recently, varying-coefficient methods have been applied to quantile regression modeling, but all previous studies assume that the data are fully observed. The main purpose of this paper is to develop a varying-coefficient approach to the estimation of regression quantiles under random data censoring. We use a weighted inverse probability approach to account for censoring, and propose a majorize–minimize type algorithm to optimize the non-smooth objective function. The asymptotic properties of the proposed estimator of the nonparametric functions are studied, and a resampling method is developed for obtaining the estimator of the sampling variance. An important aspect of our method is that it allows the censoring time to depend on the covariates. Additionally, we show that this varying-coefficient procedure can be further improved when implemented within a composite quantile regression framework. Composite quantile regression has recently gained considerable attention due to its ability to combine information across different quantile functions. We assess the finite sample properties of the proposed procedures in simulated studies. A real data application is also considered.  相似文献   

19.
李冬睿  许统德 《计算机应用》2012,32(8):2253-2257
针对现有基于流形学习的降维方法对局部邻域大小选择的敏感性,且降至低维后的数据不具有很好的可分性,提出一种自适应邻域选择的数据可分性降维方法。该方法通过估计数据的本征维度和局部切方向来自适应地选择每一样本点的邻域大小;同时,使用映射数据时的聚类信息来汇聚相似的样本点,保证降维后的数据具有良好的可分性,使之实现更好的降维效果。实验结果表明,在人工生成的数据集上,新方法获得了较好的嵌入结果;并且在人脸的可视化分类和图像检索中得到了期望的结果。  相似文献   

20.
Gene expression data are expected to be a significant aid in the development of efficient cancer diagnosis and classification platforms. However, gene expression data are high-dimensional and the number of samples is small in comparison to the dimensions of the data. Furthermore, the data are inherently noisy. Therefore, in order to improve the accuracy of the classifiers, we would be better off reducing the dimensionality of the data. As a method of dimensionality reduction, there are two previous proposals: feature selection and dimensionality reduction. Feature selection is a feedback method which incorporate the classifier algorithm in the future selection process. Dimensionality reduction refers to algorithms and techniques which create new attributes as combinations of the original attributes in order to reduce the dimensionality of a data set. In this article, we compared the feature selection methods and the dimensionality reduction methods, and verified the effectiveness of both types. For the feature selection methods we used one previously known method and three proposed methods, and for the dimensionality reduction methods we used one previously known method and one proposed method. From an experiment using a benchmark data set, we confirmed the effectiveness of our proposed method with each type of dimensional reduction method.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号