首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 218 毫秒
1.
针对高维数据"维数灾难"问题,降维是最典型的处理方式之一。降维技术不仅可以减弱"维数灾难"的负面影响,而且能够剔除高维数据中的冗余特征,从而提升高维数据回归、分类等任务的效率。高维数据通常呈现出复杂或非线性结构,恰当的降维方法可以有效地将高维特征数据投影至低维空间,以实现原始数据的非线性特征提取。本文尝试使用无监督学习模型稀疏自编码网络对金融高维数据进行非线性特征提取,将提取到的特征作为有监督学习模型BP神经网络的输入以预测指数收益率。更进一步地,为了验证稀疏自编码算法在特征提取方面的优势与有效性,本文引入稀疏主成分模型进行对比分析。实证分析显示:本文所使用的稀疏自编码网络能够较好地提取非线性特征并进行预测,其预测精度优于以稀疏主成分为代表的线性降维方法。  相似文献   

2.
线性低秩逼近与非线性降维   总被引:4,自引:0,他引:4       下载免费PDF全文
综合分析介绍了在线性与非线性数据约化两方面的最新工作: 对线性情形, 讨论了列分块矩阵奇异值分解的结构分析和稀疏低秩逼近方法与算法; 对非线性情形, 研究了非线性降维与流形学习的方法. 这些问题均为数据挖掘 与机器学习领域极受关注的研究课题.  相似文献   

3.
主要讨论了线性流形和多流形的相关性分析、聚类分析等基本问题,在假设高维数据模型为多个子空间混合模型的基础上,分析了原始数据的几何结构特征,对于线性流形聚类问题采用稀疏子空间聚类算法(SSC),对于多流形聚类问题采用混合流形聚类算法(SMMC).此外,还通过对原始数据进行数据重采样,达到降维的目的,更有效的提取空间几何特征量,达到更好的聚类效果.  相似文献   

4.
针对目前北京、上海和广州地区较严重空气污染问题,建立了基于分形流形学习的支持向量机空气污染指数预测模型.首先采用分形理论计算出空气污染数据集分形维数;其次根据分形维数,采用流形学习将高维空气污染数据集通过非线性映射嵌入到低维空间中,对空气污染数据集进行降维;最后建立基于高斯核的支持向量机预测模型对三地区空气污染指数进行预测.北京、上海和广州三地空气污染指数预测结果表明,该模型较传统预测模型,预测性能更优,具有良好的稳定性和有效性.  相似文献   

5.
针对目前北京、上海和广州地区较严重空气污染问题,建立了基于分形流形学习的支持向量机空气污染指数预测模型.首先采用分形理论计算出空气污染数据集分形维数;其次根据分形维数,采用流形学习将高维空气污染数据集通过非线性映射嵌入到低维空间中,对空气污染数据集进行降维;最后建立基于高斯核的支持向量机预测模型对三地区空气污染指数进行预测.北京、上海和广州三地空气污染指数预测结果表明,该模型较传统预测模型,预测性能更优,具有良好的稳定性和有效性.  相似文献   

6.
现有的子空间聚类方法以数据全局线性分布为前提,利用先验约束估计未标记数据点的低维子空间,并将其聚类到相应组中,对非线性结构的数据处理存在一定缺陷.受启发于深度学习以其强大的非线性学习表征能力在众多应用中取得巨大成功,文章在数据表示中加入成对约束,并运用流形正则化理论,采用k近邻构造全局相似度矩阵,通过与自编码器的联合学习,提出基于流形正则化与成对约束的深度半监督谱聚类算法(MPAE).该算法一方面在学习数据的低维表示时同时保留数据的可重构性和局部流形结构的全局特征,另一方面将已知样本间的成对约束信息融入目标优化设计,使学习到的低维特征更具有判别性,这在很大程度上提高了所得算法的聚类性能.实验结果表明文章算法能够取得理想的聚类结果.  相似文献   

7.
针对高维数据集常常存在冗余和维数灾难,在其上直接构造覆盖模型难以充分反映数据分布信息的问题,提出一种基于稀疏降维近似凸壳覆盖模型.首先采用同伦算法求解稀疏表示中l_1优化问题,通过稀疏约束自动获取合理近邻数并构建图,再通过LPP(Locality Preserving Projections)来进行局部保持投影,进而实现对高维空间快速有效地降维,最后在低维空间通过构造近似凸壳覆盖实现一类分类.在UCI数据库,MNIST手写体数据库和MIT-CBCL人脸识别数据库上的实验结果证实了方法的有效性,与现有的一类分类算法相比,提出的覆盖模型具有更高的分类正确率.  相似文献   

8.
基于数据流形结构的聚类方法及其应用研究   总被引:1,自引:0,他引:1  
随着信息社会的不断发展,人类已经进入了信息爆炸时代,海量的数据使数据处理变得繁琐复杂,因此如何对现有的高维数据降维、聚类,并在一定程度上消除高维数据中存在的噪声是解决该问题的关键.基于相关的理论知识采用先降维后聚类的步骤,把高维数据按照子空间结构和流形结构两种情况分类,运用稀疏子空间聚类、谱多流形聚类、K-manifolds方法进行建模求解,通过对各种方法的对比,得出谱多流形聚类方法运行速度快,聚类准确度高,是最具有一般性特征的模型.  相似文献   

9.
流形学习是一种新的非线性维数约简方法,近年来正引起可视化等领域研究者的高度重视.为加深对流形学习的理解,介绍了流形学习的基本原理,总结了其研究进展和分类方法,最后阐述了几种常用的流形学习方法的基本思想、算法步骤和各自的优缺点.通过在人工数据集Swiss-Roll上进行实验,将各类方法在近邻值选取和噪声影响等方面进行了对比分析,结果表明:与传统的线性维数约简方法相比,流形学习方法能够有效地发现观测样本的低维结构.最后对流形学习未来的研究方向作出展望,以期在这一领域取得更大进展.  相似文献   

10.
可定向的具非负曲率完备非紧黎曼流形   总被引:5,自引:0,他引:5  
詹华税 《数学进展》2001,30(1):70-74
本文研究了具非负曲率完备非紧黎曼流形的一些几何性质,包括闭测地线,体积等.证明了核心的余维数为奇数的可定向具非负曲率完备非紧黎曼流形在其核心的任一法测地线均为射线的条件下可等距分裂为R×N,其中N为低一维的流形.  相似文献   

11.
This paper deals with a method, called locally linear embedding (LLE). It is a nonlinear dimensionality reduction technique that computes low-dimensional, neighborhood-preserving embeddings of high-dimensional data and attempts to discover a nonlinear structure (including manifolds) in high-dimensional data. In practice, the nonlinear manifold learning methods are applied in image processing, text mining, etc. The implementation of the LLE algorithm is fairly straightforward, because the algorithm has only two control parameters: the number of neighbors of each data point and the regularization parameter. The mapping quality is quite sensitive to these parameters. In this paper, we propose a new way of selecting a regularization parameter of a local Gram matrix.  相似文献   

12.
Nonlinear dimensionality reduction (NLDR) algorithms such as Isomap, LLE, and Laplacian Eigenmaps address the problem of representing high-dimensional nonlinear data in terms of low-dimensional coordinates which represent the intrinsic structure of the data. This paradigm incorporates the assumption that real-valued coordinates provide a rich enough class of functions to represent the data faithfully and efficiently. On the other hand, there are simple structures which challenge this assumption: the circle, for example, is one-dimensional, but its faithful representation requires two real coordinates. In this work, we present a strategy for constructing circle-valued functions on a statistical data set. We develop a machinery of persistent cohomology to identify candidates for significant circle-structures in the data, and we use harmonic smoothing and integration to obtain the circle-valued coordinate functions themselves. We suggest that this enriched class of coordinate functions permits a precise NLDR analysis of a broader range of realistic data sets.  相似文献   

13.
针对传统DBSCAN算法对高维数据集聚类效果不佳且参数的选取敏感问题,提出一种新的基于相似性度量的改进DBSCAN算法.该算法构造了测地距离和共享最近邻的数据点之间的相似度矩阵,克服欧式距离对高维数据的局限性,更好地刻画数据集的真实情况.通过分析数据的分布特征来自适应确定Eps和MinPts参数.实验结果表明,所提GS-DBSCAN算法能够有效地对复杂分布的数据进行聚类,且在高维数据的聚类准确率高于对比算法,验证了算法的准确性和可行性.  相似文献   

14.
Data sets in high-dimensional spaces are often concentrated near low-dimensional sets. Geometric Multi-Resolution Analysis (Allard, Chen, Maggioni, 2012) was introduced as a method for approximating (in a robust, multiscale fashion) a low-dimensional set around which data may concentrated and also providing dictionary for sparse representation of the data. Moreover, the procedure is very computationally efficient. We introduce an estimator for low-dimensional sets supporting the data constructed from the GMRA approximations. We exhibit (near optimal) finite sample bounds on its performance, and demonstrate the robustness of this estimator with respect to noise and model error. In particular, our results imply that, if the data is supported on a low-dimensional manifold, the proposed sparse representations result in an error which depends only on the intrinsic dimension of the manifold. (© 2014 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim)  相似文献   

15.
In practical data mining tasks, high-dimensional data has to be analyzed. In most of the cases it is very informative to map and visualize the hidden structure of a complex data set in a low-dimensional space. In this paper a new class of mapping algorithms is defined. These algorithms combine topology representing networks and different nonlinear mapping algorithms. While the former methods aim to quantify the data and disclose the real structure of the objects, the nonlinear mapping algorithms are able to visualize the quantized data in the low-dimensional vector space. In this paper, techniques based on these methods are gathered and the results of a detailed analysis performed on them are shown. The primary aim of this analysis is to examine the preservation of distances and neighborhood relations of the objects. Preservation of neighborhood relations was analyzed both in local and global environments. To evaluate the main properties of the examined methods we show the outcome of the analysis based both on synthetic and real benchmark examples.  相似文献   

16.
现有一类分类算法通常采用经典欧氏测度描述样本间相似关系,然而欧氏测度不能较好地反映一些数据集样本的内在分布结构,从而影响这些方法对数据的描述能力.提出一种用于改善一类分类器描述性能的高维空间一类数据距离测度学习算法,与已有距离测度学习算法相比,该算法只需提供目标类数据,通过引入样本先验分布正则化项和L1范数惩罚的距离测度稀疏性约束,能有效解决高维空间小样本情况下的一类数据距离测度学习问题,并通过采用分块协调下降算法高效的解决距离测度学习的优化问题.学习的距离测度能容易的嵌入到一类分类器中,仿真实验结果表明采用学习的距离测度能有效改善一类分类器的描述性能,特别能够改善SVDD的描述能力,从而使得一类分类器具有更强的推广能力.  相似文献   

17.
A popular approach for analyzing high-dimensional datasets is to perform dimensionality reduction by applying non-parametric affinity kernels. Usually, it is assumed that the represented affinities are related to an underlying low-dimensional manifold from which the data is sampled. This approach works under the assumption that, due to the low-dimensionality of the underlying manifold, the kernel has a low numerical rank. Essentially, this means that the kernel can be represented by a small set of numerically-significant eigenvalues and their corresponding eigenvectors.We present an upper bound for the numerical rank of Gaussian convolution operators, which are commonly used as kernels by spectral manifold-learning methods. The achieved bound is based on the underlying geometry that is provided by the manifold from which the dataset is assumed to be sampled. The bound can be used to determine the number of significant eigenvalues/eigenvectors that are needed for spectral analysis purposes. Furthermore, the results in this paper provide a relation between the underlying geometry of the manifold (or dataset) and the numerical rank of its Gaussian affinities.The term cover-based bound is used because the computations of this bound are done by using a finite set of small constant-volume boxes that cover the underlying manifold (or the dataset). We present bounds for finite Gaussian kernel matrices as well as for the continuous Gaussian convolution operator. We explore and demonstrate the relations between the bounds that are achieved for finite and continuous cases. The cover-oriented methodology is also used to provide a relation between the geodesic length of a curve and the numerical rank of Gaussian kernel of datasets that are sampled from it.  相似文献   

18.
This article proposes a modeling framework for high-dimensional experimental data, such as brain images or microarrays, that discovers statistically significant structures most relevant to the experimental covariates. To deal with the curse of dimensionality, three regularization schemes are used: a reduced-rank model, penalization of the covariance matrix, and regularization of the basis-expanded predictor set. The latter allows us to flexibly model associations while controlling for overfitting. The modeling framework is derived from a reduced-rank multiresponse linear model, which offers a familiar interface for researchers. The novel regularizations of both sides of the model make it applicable in high-dimensional settings, without a need for prior dimension reduction, and can model nonlinear relationships. An efficient, dual-space algorithm is proposed to estimate its components in low-dimensional space. It permits the use of the bootstrap, to provide pointwise standard error bands on association graphs, and other resampling techniques to optimize hyperparameters. We evaluate the model on a small neuroimaging dataset, and in a simulation study using simple images corrupted by additive Gaussian iid and random field noise components with signal-to-noise ratios below 0.1. Our model compares well with a general linear model (GLM) even when the nonlinear associations are specified explicitly in GLM.  相似文献   

19.
用LDA Boosting算法进行客户流失预测   总被引:2,自引:1,他引:1  
本文提出一种LDA boost(Linear Discriminant Analysis boost)分类方法,该算法能有效利用样本的所有特征,并且能够从高维特征空间里提取并组合优化出最具有判别能力的低维特征,使得样本类间离散度和类内离散度的比值最大,从而不会产生过度学习,大大提高算法效率。该算法有效性在某商业银行的客户流失预测过程的真实数据集中得到了验证。与其他同类算法,如人工神经网络、决策树、支持向量机等运算结果相比,该方法可以显著提高运算精度。同时,LDAboosting与其他boosting算法相比,也具有显著的优越性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号