首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 501 毫秒
1.
Locally weighted regression is a technique that predicts the response for new data items from their neighbors in the training data set, where closer data items are assigned higher weights in the prediction. However, the original method may suffer from overfitting and fail to select the relevant variables. In this paper we propose combining a regularization approach with locally weighted regression to achieve sparse models. Specifically, the lasso is a shrinkage and selection method for linear regression. We present an algorithm that embeds lasso in an iterative procedure that alternatively computes weights and performs lasso-wise regression. The algorithm is tested on three synthetic scenarios and two real data sets. Results show that the proposed method outperforms linear and local models for several kinds of scenarios.  相似文献   

2.
3.
Geographic information systems (GIS) organize spatial data in multiple two-dimensional arrays called layers. In many applications, a response of interest is observed on a set of sites in the landscape, and it is of interest to build a regression model from the GIS layers to predict the response at unsampled sites. Model selection in this context then consists not only of selecting appropriate layers, but also of choosing appropriate neighborhoods within those layers. We formalize this problem as a linear model and propose the use of Lasso to simultaneously select variables, choose neighborhoods, and estimate parameters. Spatially dependent errors are accounted for using generalized least squares and spatial smoothness in selected coefficients is incorporated through use of a priori spatial covariance structure. This leads to a modification of the Lasso procedure, called spatial Lasso. The spatial Lasso can be implemented by a fast algorithm and it performs well in numerical examples, including an application to prediction of soil moisture. The methodology is also extended to generalized linear models. Supplemental materials including R computer code and data analyzed in this article are available online.  相似文献   

4.
Clusterwise regression consists of finding a number of regression functions each approximating a subset of the data. In this paper, a new approach for solving the clusterwise linear regression problems is proposed based on a nonsmooth nonconvex formulation. We present an algorithm for minimizing this nonsmooth nonconvex function. This algorithm incrementally divides the whole data set into groups which can be easily approximated by one linear regression function. A special procedure is introduced to generate a good starting point for solving global optimization problems at each iteration of the incremental algorithm. Such an approach allows one to find global or near global solution to the problem when the data sets are sufficiently dense. The algorithm is compared with the multistart Späth algorithm on several publicly available data sets for regression analysis.  相似文献   

5.
Similarities in Fuzzy Regression Models   总被引:1,自引:0,他引:1  
The solutions of a fuzzy regression model are obtained by converting the problem into a linear programming problem. For each level h, h[0, 1), there exists a solution. In this paper, we study the set of all the solutions to the fuzzy regression model that comes from a set of data as a metric space with an appropriate metric on it. We define a similarity ratio that allows us to compare the spaces of solutions of a fuzzy regression model that come from different sets of data. We also give an application using data sets concerning the GNP–money relationship.  相似文献   

6.
7.
In genetic studies of complex diseases, particularly mental illnesses, and behavior disorders, two distinct characteristics have emerged in some data sets. First, genetic data sets are collected with a large number of phenotypes that are potentially related to the complex disease under study. Second, each phenotype is collected from the same subject repeatedly over time. In this study, we present a nonparametric regression approach to study multivariate and time-repeated phenotypes together by using the technique of the multivariate adaptive regression splines for analysis of longitudinal data (MASAL), which makes it possible to identify genes, gene-gene and gene-environment, including time, interactions associated with the phenotypes of interest. Furthermore, we propose a permutation test to assess the associations between the phenotypes and selected markers. Through simulation, we demonstrate that our proposed approach has advantages over the existing methods that examine each longitudinal phenotype separately or analyze the summarized values of phenotypes by compressing them into one-time-point phenotypes. Application of the proposed method to the Framingham Heart Study illustrates that the use of multivariate longitudinal phenotypes enhanced the significance of the association test.  相似文献   

8.
Much work has focused on developing exact tests for the analysis of discrete data using log linear or logistic regression models. A parametric model is tested for a dataset by conditioning on the value of a sufficient statistic and determining the probability of obtaining another dataset as extreme or more extreme relative to the general model, where extremeness is determined by the value of a test statistic such as the chi-square or the log-likelihood ratio. Exact determination of these probabilities can be infeasible for high dimensional problems, and asymptotic approximations to them are often inaccurate when there are small data entries and/or there are many nuisance parameters. In these cases Monte Carlo methods can be used to estimate exact probabilities by randomly generating datasets (tables) that match the sufficient statistic of the original table. However, naive Monte Carlo methods produce tables that are usually far from matching the sufficient statistic. The Markov chain Monte Carlo method used in this work (the regression/attraction approach) uses attraction to concentrate the distribution around the set of tables that match the sufficient statistic, and uses regression to take advantage of information in tables that “almost” match. It is also more general than others in that it does not require the sufficient statistic to be linear, and it can be adapted to problems involving continuous variables. The method is applied to several high dimensional settings including four-way tables with a model of no four-way interaction, and a table of continuous data based on beta distributions. It is powerful enough to deal with the difficult problem of four-way tables and flexible enough to handle continuous data with a nonlinear sufficient statistic.  相似文献   

9.
The goals of this paper are twofold: we describe common features in data sets from motor vehicle insurance companies and we investigate a general strategy which exploits the knowledge of such features. The results of the strategy are a basis to develop insurance tariffs. We use a nonparametric approach based on a combination of kernel logistic regression and ε-support vector regression which both have good robustness properties. The strategy is applied to a data set from motor vehicle insurance companies.  相似文献   

10.
Currently, prenatal screening for Down Syndrome (DS) uses the mother's age as well as three biochemical markers for risk prediction. Risk calculations for the biochemical markers use a quadratic discriminant function. In this paper we compare several classification procedures to quadratic discrimination methods for biochemical-based DS risk prediction, based on data from a prospective multicentre prenatal screening study. We investigate alternative methods including linear discriminant methods, logistic regression methods, neural network methods, and classification and regression-tree methods. Several experiments are performed, and in each experiment resampling methods are used to create training and testing data sets. The procedures on the test data set are summarized by the area under their receiver operating characteristic curves. In each experiment this process is repeated 500 times and then the classification procedures are compared. We find that several methods are superior to the currently used quadratic discriminant method for risk estimation for these data. The implications of these results for prenatal screening programs are discussed.  相似文献   

11.
Active set algorithms for isotonic regression; A unifying framework   总被引:1,自引:0,他引:1  
In this and subsequent papers we will show that several algorithms for the isotonic regression problem may be viewed as active set methods. The active set approach provides a unifying framework for studying algorithms for isotonic regression, simplifies the exposition of existing algorithms and leads to several new efficient algorithms. We also investigate the computational complexity of several algorithms.In this paper we consider the isotonic regression problem with respect to a complete order where eachw i is strictly positive and eachy i is an arbitrary real number. We show that the Pool Adjacent Violators algorithm (due to Ayer et al., 1955; Miles, 1959; Kruskal, 1964), is a dual feasible active set method and that the Minimum Lower Set algorithm (due to Brunk et al., 1957) is a primal feasible active set method of computational complexity O(n 2). We present a new O(n) primal feasible active set algorithm. Finally we discuss Van Eeden's method and show that it is of worst-case exponential time complexity.This work was supported by the National Science and Engineering Research Council of Canada under Research Grant A8189 and an Ontario Graduate Scholarship.  相似文献   

12.
Abstract

The existence of outliers in a data set and how to deal with them is an important problem in statistics. The minimum volume ellipsoid (MVE) estimator is a robust estimator of location and covariate structure; however its use has been limited because there are few computationally attractive methods. Determining the MVE consists of two parts—finding the subset of points to be used in the estimate and finding the ellipsoid that covers this set. This article addresses the first problem. Our method will also allow us to compute the minimum covariance determinant (MCD) estimator. The proposed method of subset selection is called the effective independence distribution (EID) method, which chooses the subset by minimizing determinants of matrices containing the data. This method is deterministic, yielding reproducible estimates of location and scatter for a given data set. The EID method of finding the MVE is applied to several regression data sets where the true estimate is known. Results show that the EID method, when applied to these data sets, produces the subset of data more quickly than conventional procedures and that there is less than 6% relative error in the estimates. We also give timing results illustrating the feasibility of our method for larger data sets. For the case of 10,000 points in 10 dimensions, the compute time is under 25 minutes.  相似文献   

13.
In this paper, considering of the special geometry of compositional data, two new methods for estimating missing values in compositional data are introduced. The first method uses the mean in the simplex space which mainly finds the-nearest neighbor procedure based on the Aitchison distance, combining with two basic operations on the simplex, perturbation and powering. As a second proposal the principal component regression imputation method is introduced which initially starts from the result of the proposed the mean in the simplex. The method uses ilr transformation to transform the compositional data set, and then uses principal component regression in a transformed space. The proposed methods are tested on real data and simulated data sets, the results show that the proposed methods work well.  相似文献   

14.
Shipping companies are forced by the current EU regulation to set up a system for monitoring, reporting, and verification of harmful emissions from their fleet. In this regulatory background, data collected from onboard sensors can be utilized to assess the ship's operating conditions and quantify its CO2 emission levels. The standard approach for analyzing such data sets is based on summarizing the measurements obtained during a given voyage by the average value. However, this compression step may lead to significant information loss since most variables present a dynamic profile that is not well approximated by the average value only. Therefore, in this work, we test two feature‐oriented methods that are able to extract additional features, namely, profile‐driven features (PdF) and statistical pattern analysis (SPA). A real data set from a Ro‐Pax ship is then considered to test the selected methods. The data set is segregated according to the voyage distance into short, medium, and long routes. Both PdF and SPA are compared with the standard approach, and the results demonstrate the benefits of employing more systematic and informative feature‐oriented methods. For the short route, no method is able to predict CO2 emissions in a satisfactory way, whereas for the medium and long routes, regression models built using features obtained from both PdF and SPA improve their prediction performance. In particular, for the long route, the standard approach failed to provide reasonably good predictions.  相似文献   

15.
PLS回归在消除多重共线性中的作用   总被引:12,自引:1,他引:11  
本文详细阐述了解释变量的多重共线性在回归建模与分析中的危害作用,并指出目前常用的几种消除多重线性影响的方法,以及它们的不足之处。本文结合实证研究指出:利用一种新的建模思路—PLS回归,可以更好地消除多重共线性对建模准确性与可靠性所带来的影响  相似文献   

16.
对现象之间客观存在的因果关系建立回归分析模型,这是实际中较为普遍的做法.在这篇文章中,我们根据MULTIVARIATE回归分析的基本原理,利用从生产现场采集的观测数据,对产品两个质量特性及其五个关键影响因素之间的关系建立了多重多元回归分析方程,为说明MULTIVARIATE回归应用的可行性,我们还结合实例给出了因变量向量估计的两种形式,以及无条件预报的置信区间。  相似文献   

17.
In insurance (or in finance) practice, in a regression setting, there are cases where the error distribution is not normal and other cases where the set of data is contaminated due to outlier events. In such cases the classical credibility regression models lead to an unsatisfactory behavior of credibility estimators, and it is more appropriate to use quantile regression instead of the ordinary least squares estimation. However, these quantile credibility models cannot perform effectively when the set of data has nested (hierarchical) structure. This paper develops credibility models for regression quantiles with nested classification as an alternative to Norberg’s (1986) approach of random coefficient regression model with multi-stage nested classification. This paper illustrates two types of applications, one with insurance data and one with Fama/French financial data.  相似文献   

18.
We propose a two-component graphical chain model, the discrete regression distribution, where a set of discrete random variables is modeled as a response to a set of categorical and continuous covariates. The proposed model is useful for modeling a set of discrete variables measured at multiple sites along with a set of continuous and/or discrete covariates. The proposed model allows for joint examination of the dependence structure of the discrete response and observed covariates and also accommodates site-to-site variability. We develop the graphical model properties and theoretical justifications of this model. Our model has several advantages over the traditional logistic normal model used to analyze similar compositional data, including site-specific random effect terms and the incorporation of discrete and continuous covariates.  相似文献   

19.
Claims reserving is obviously necessary for representing future obligations of an insurance company and selection of an accurate method is a major component of the overall claims reserving process. However, the wide range of unquantifiable factors which increase the uncertainty should be considered when using any method to estimate the amount of outstanding claims based on past data. Unlike traditional methods in claims analysis, fuzzy set approaches can tolerate imprecision and uncertainty without loss of performance and effectiveness. In this paper, hybrid fuzzy least-squares regression, which is proposed by Chang (2001), is used to predict future claim costs by utilizing the concept of a geometric separation method. We use probabilistic confidence limits for designing triangular fuzzy numbers. Thus, it allows us to reflect variability measures contained in a data set in the prediction of future claim costs. We also propose weighted functions of fuzzy numbers as a defuzzification procedure in order to transform estimated fuzzy claim costs into a crisp real equivalent.  相似文献   

20.
Multiblock component methods are applied to data sets for which several blocks of variables are measured on a same set of observations with the goal to analyze the relationships between these blocks of variables. In this article, we focus on multiblock component methods that integrate the information found in several blocks of explanatory variables in order to describe and explain one set of dependent variables. In the following, multiblock PLS and multiblock redundancy analysis are chosen, as particular cases of multiblock component methods when one set of variables is explained by a set of predictor variables that is organized into blocks. Because these multiblock techniques assume that the observations come from a homogeneous population they will provide suboptimal results when the observations actually come from different populations. A strategy to palliate this problem—presented in this article—is to use a technique such as clusterwise regression in order to identify homogeneous clusters of observations. This approach creates two new methods that provide clusters that have their own sets of regression coefficients. This combination of clustering and regression improves the overall quality of the prediction and facilitates the interpretation. In addition, the minimization of a well-defined criterion—by means of a sequential algorithm—ensures that the algorithm converges monotonously. Finally, the proposed method is distribution-free and can be used when the explanatory variables outnumber the observations within clusters. The proposed clusterwise multiblock methods are illustrated with of a simulation study and a (simulated) example from marketing.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号