首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Exact global optimization of the clusterwise regression problem is challenging and there are currently no published feasible methods for performing this clustering optimally, even though it has been over thirty years since its original proposal. This work explores global optimization of the clusterwise regression problem using mathematical programming and related issues. A mixed logical-quadratic programming formulation with implication of constraints is presented and contrasted against a quadratic formulation based on the traditional big-M, which cannot guarantee optimality because the regression line coefficients, and thus errors, may be arbitrarily large. Clusterwise regression optimization times and solution optimality for two clusters are empirically tested on twenty real datasets and three series of synthetic datasets ranging from twenty to one hundred observations and from two to ten independent variables. Additionally, a few small real datasets are clustered into three lines.  相似文献   

2.
We propose a functional extension of fuzzy clusterwise regression, which estimates fuzzy memberships of clusters and regression coefficient functions for each cluster simultaneously. The proposed method permits dependent and/or predictor variables to be functional, varying over time, space, and other continua. The fuzzy memberships and clusterwise regression coefficient functions are estimated by minimizing an objective function that adopts a basis function expansion approach to approximating functional data. An alternating least squares algorithm is developed to minimize the objective function. We conduct simulation studies to demonstrate the superior performance of the proposed method compared to its non-functional counterpart and to examine the performance of various cluster validity measures for selecting the optimal number of clusters. We apply the proposed method to real datasets to illustrate the empirical usefulness of the proposed method.  相似文献   

3.
Clusterwise regression consists of finding a number of regression functions each approximating a subset of the data. In this paper, a new approach for solving the clusterwise linear regression problems is proposed based on a nonsmooth nonconvex formulation. We present an algorithm for minimizing this nonsmooth nonconvex function. This algorithm incrementally divides the whole data set into groups which can be easily approximated by one linear regression function. A special procedure is introduced to generate a good starting point for solving global optimization problems at each iteration of the incremental algorithm. Such an approach allows one to find global or near global solution to the problem when the data sets are sufficiently dense. The algorithm is compared with the multistart Späth algorithm on several publicly available data sets for regression analysis.  相似文献   

4.
Fuzzy clusterwise regression has been a useful method for investigating cluster-level heterogeneity of observations based on linear regression. This method integrates fuzzy clustering and ordinary least-squares regression, thereby enabling to estimate regression coefficients for each cluster and fuzzy cluster memberships of observations simultaneously. In practice, however, fuzzy clusterwise regression may suffer from multicollinearity as it builds on ordinary least-squares regression. To deal with this problem in fuzzy clusterwise regression, a new method, called regularized fuzzy clusterwise ridge regression, is proposed that combines ridge regression with regularized fuzzy clustering in a unified framework. In the proposed method, ridge regression is adopted to estimate clusterwise regression coefficients while handling potential multicollinearity among predictor variables. In addition, regularized fuzzy clustering based on maximizing entropy is utilized to systematically determine an optimal degree of fuzziness in memberships. A simulation study is conducted to evaluate parameter recovery of the proposed method as compared to the extant non-regularized counterpart. The usefulness of the proposed method is illustrated by an application concerning the relationship among the characteristics of used cars.  相似文献   

5.
Several papers have already stressed the interest of latent root regression and its similarities to partial least squares regression. A new formulation of this method which makes it even simpler than the original method to set up a prediction model is discussed. Furthermore, it is shown how this method can be extended not only to the case where it is desired to predict several response variables from a set of predictors but also to the multiblock setting where the aim is to predict one or several data sets from several other data sets. The interest of the method is illustrated on the basis of a data set pertaining to epidemiology.  相似文献   

6.
The clusterwise regression model is used to perform cluster analysis within a regression framework. While the traditional regression model assumes the regression coefficient (β) to be identical for all subjects in the sample, the clusterwise regression model allows β to vary with subjects of different clusters. Since the cluster membership is unknown, the estimation of the clusterwise regression is a tough combinatorial optimization problem. In this research, we propose a “Generalized Clusterwise Regression Model” which is formulated as a mathematical programming (MP) problem. A nonlinear programming procedure (with linear constraints) is proposed to solve the combinatorial problem and to estimate the cluster membership and β simultaneously. Moreover, by integrating the cluster analysis with the discriminant analysis, a clusterwise discriminant model is developed to incorporate parameter heterogeneity into the traditional discriminant analysis. The cluster membership and discriminant parameters are estimated simultaneously by another nonlinear programming model.  相似文献   

7.
This paper presents an extension of the standard regression tree method to clustered data. Previous works extending tree methods to accommodate correlated data are mainly based on the multivariate repeated-measures approach. We propose a “mixed effects regression tree” method where the correlated observations are viewed as nested within clusters rather than as vectors of multivariate repeated responses. The proposed method can handle unbalanced clusters, allows observations within clusters to be split, and can incorporate random effects and observation-level covariates. We implemented the proposed method using a standard tree algorithm within the framework of the expectation-maximization (EM) algorithm. The simulation results show that the proposed regression tree method provides substantial improvements over standard trees when the random effects are non negligible. A real data example is used to illustrate the method.  相似文献   

8.
We extend the least angle regression algorithm using the information geometry of dually flat spaces. The extended least angle regression algorithm is used for estimating parameters in generalized linear regression, and it can be also used for selecting explanatory variables. We use the fact that a model manifold of an exponential family is a dually flat space. In estimating parameters, curves corresponding to bisectors in the Euclidean space play an important role. Originally, the least angle regression algorithm is used for estimating parameters and selecting explanatory variables in linear regression. It is an efficient algorithm in the sense that the number of iterations is the same as the number of explanatory variables. We extend the algorithm while keeping this efficiency. However, the extended least angle regression algorithm differs significantly from the original algorithm. The extended least angle regression algorithm reduces one explanatory variable in each iteration while the original algorithm increases one explanatory variable in each iteration. We show results of the extended least angle regression algorithm for two types of datasets. The behavior of the extended least angle regression algorithm is shown. Especially, estimates of parameters become smaller and smaller, and vanish in turn.  相似文献   

9.
Biased regression is an alternative to ordinary least squares (OLS) regression, especially when explanatory variables are highly correlated. In this paper, we examine the geometrical structure of the shrinkage factors of biased estimators. We show that, in most cases, shrinkage factors cannot belong to [0,1] in all directions. We also compare the shrinkage factors of ridge regression (RR), principal component regression (PCR) and partial least-squares regression (PLSR) in the orthogonal directions obtained by the signal-to-noise ratio (SNR) algorithm. In these directions, we find that PLSR and RR behave well, whereas shrinkage factors of PCR have an erratic behaviour.  相似文献   

10.
A data analysis method is proposed to derive a latent structure matrix from a sample covariance matrix. The matrix can be used to explore the linear latent effect between two sets of observed variables. Procedures with which to estimate a set of dependent variables from a set of explanatory variables by using latent structure matrix are also proposed. The proposed method can assist the researchers in improving the effectiveness of the SEM models by exploring the latent structure between two sets of variables. In addition, a structure residual matrix can also be derived as a by-product of the proposed method, with which researchers can conduct experimental procedures for variables combinations and selections to build various models for hypotheses testing. These capabilities of data analysis method can improve the effectiveness of traditional SEM methods in data property characterization and models hypotheses testing. Case studies are provided to demonstrate the procedure of deriving latent structure matrix step by step, and the latent structure estimation results are quite close to the results of PLS regression. A structure coefficient index is suggested to explore the relationships among various combinations of variables and their effects on the variance of the latent structure.  相似文献   

11.
Fixed point clustering is a new stochastic approach to cluster analysis. The definition of a single fixed point cluster (FPC) is based on a simple parametric model, but there is no parametric assumption for the whole dataset as opposed to mixture modeling and other approaches. An FPC is defined as a data subset that is exactly the set of non-outliers with respect to its own parameter estimators. This paper concentrates upon the theoretical foundation of FPC analysis as a method for clusterwise linear regression, i.e., the single clusters are modeled as linear regressions with normal errors. In this setup, fixed point clustering is based on an iteratively reweighted estimation with zero weight for all outliers. FPCs are non-hierarchical, but they may overlap and include each other. A specification of the number of clusters is not needed. Consistency results are given for certain mixture models of interest in cluster analysis. Convergence of a fixed point algorithm is shown. Application to a real dataset shows that fixed point clustering can highlight some other interesting features of datasets compared to maximum likelihood methods in the presence of deviations from the usual assumptions of model based cluster analysis.  相似文献   

12.
In hybrid joint probability density function (joint PDF) algorithms for turbulent reactive flows the equations for the mean flow discretized with a classical grid based method (e.g. finite volume methods (FVM)) are solved together with a Monte Carlo (particle) method for the joint velocity composition PDF. When applied for complex geometries, the solution strategy for such methods which aims at obtaining a converged solution of the coupled problem on a sufficiently fine grid becomes very important. This paper describes one important aspect of this solution strategy, i.e. multigrid computing, which is well known to be very efficient for computing numerical solutions on fine grids. Two sets of grid based variables are involved: cell-centered variables from the FVM and node-centered variables, which denote the moments of the PDF extracted from the particle fields. Starting from a given multiblock grid environment first a new (refined or coarsened) grid is defined retaining the grid quality. The projection and prolongation operators are defined for the two sets of variables. In this new grid environment the particles are redistributed. The effectiveness of the multigrid algorithm is demonstrated. Compared to solely solving on the finest grid, convergence can be reached about one order of magnitude faster when using the multigrid algorithm in three stages. Computation time used for projection or prolongation is negligible. (© 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim)  相似文献   

13.
Fixed effects models are very flexible because they do not make assumptions on the distribution of effects and can also be used if the heterogeneity component is correlated with explanatory variables. A disadvantage is the large number of effects that have to be estimated. A recursive partitioning (or tree based) method is proposed that identifies clusters of units that share the same effect. The approach reduces the number of parameters to be estimated and is useful in particular if one is interested in identifying clusters with the same effect on a response variable. It is shown that the method performs well and outperforms competitors like the finite mixture model in particular if the heterogeneity component is correlated with explanatory variables. In two applications the usefulness of the approach to identify clusters that share the same effect is illustrated. Supplementary materials for this article are available online.  相似文献   

14.
In this paper we handle the general problem of finding q(> 1) central relations on a set of objects which best fit the information contained in a finite number of given relations on that set. The proposed CAR (clusterwise aggregation of relations) algorithm allows one to consider the well-known situation of determining a single central relation as a special case (q = 1) and takes into account the fact that the representation of appropriately selected subsets of relations by different central relations can provide additional insights into whether different clusters or segments of relations exist in the given set of relations. Two examples demonstrate the usefulness of the suggested approach.  相似文献   

15.
A cluster-based method for constructing sparse principal components is proposed. The method initially forms clusters of variables, using a new clustering approach called the semi-partition, in two steps. First, the variables are ordered sequentially according to a criterion involving the correlations between variables. Then, the ordered variables are split into two parts based on their generalized variance. The first group of variables becomes an output cluster, while the second one—input for another run of the sequential process. After the optimal clusters have been formed, sparse components are constructed from the singular value decomposition of the data matrices of each cluster. The method is applied to simple data sets with smaller number of variables (p) than observations (n), as well as large gene expression data sets with p ? n. The resulting cluster-based sparse principal components are very promising as evaluated by objective criteria. The method is also compared with other existing approaches and is found to perform well.  相似文献   

16.
For several years, model-based clustering methods have successfully tackled many of the challenges presented by data-analysts. However, as the scope of data analysis has evolved, some problems may be beyond the standard mixture model framework. One such problem is when observations in a dataset come from overlapping clusters, whereby different clusters will possess similar parameters for multiple variables. In this setting, mixed membership models, a soft clustering approach whereby observations are not restricted to single cluster membership, have proved to be an effective tool. In this paper, a method for fitting mixed membership models to data generated by a member of an exponential family is outlined. The method is applied to count data obtained from an ultra running competition, and compared with a standard mixture model approach.  相似文献   

17.
On the basis of two data sets containing Loss Given Default (LGD) observations of home equity and corporate loans, we consider non-linear and non-parametric techniques to model and forecast LGD. These techniques include non-linear Support Vector Regression (SVR), a regression tree, a transformed linear model and a two-stage model combining a linear regression with SVR. We compare these models with an ordinary least squares linear regression. In addition, we incorporate several variants of 11 macroeconomic indicators to estimate the influence of the economic state on loan losses. The out-of-time set-up is complemented with an out-of-sample set-up to mitigate the limited number of credit crisis observations available in credit risk data sets. The two-stage/transformed model outperforms the other techniques when forecasting out-of-time for the home equity/corporate data set, while the non-parametric regression tree is the best performer when forecasting out-of-sample. The incorporation of macroeconomic variables significantly improves the prediction performance. The downturn impact ranges up to 5% depending on the data set and the macroeconomic conditions defining the downturn. These conclusions can help financial institutions when estimating LGD under the internal ratings-based approach of the Basel Accords in order to estimate the downturn LGD needed to calculate the capital requirements. Banks are also required as part of stress test exercises to assess the impact of stressed macroeconomic scenarios on their Profit and Loss (P&L) and banking book, which favours the accurate identification of relevant macroeconomic variables driving LGD evolutions.  相似文献   

18.
The maximum independent set problem is NP-hard and particularly difficult to solve in sparse graphs, which typically take exponential time to solve exactly using the best-known exact algorithms. In this paper, we present two new novel heuristic algorithms for computing large independent sets on huge sparse graphs, which are intractable in practice. First, we develop an advanced evolutionary algorithm that uses fast graph partitioning with local search algorithms to implement efficient combine operations that exchange whole blocks of given independent sets. Though the evolutionary algorithm itself is highly competitive with existing heuristic algorithms on large social networks, we further show that it can be effectively used as an oracle to guess vertices that are likely to be in large independent sets. We then show how to combine these guesses with kernelization techniques in a branch-and-reduce-like algorithm to compute high-quality independent sets quickly in huge complex networks. Our experiments against a recent (and fast) exact algorithm for large sparse graphs show that our technique always computes an optimal solution when the exact solution is known, and it further computes consistent results on even larger instances where the solution is unknown. Ultimately, we show that identifying and removing vertices likely to be in large independent sets opens up the reduction space—which not only speeds up the computation of large independent sets drastically, but also enables us to compute high-quality independent sets on much larger instances than previously reported in the literature.  相似文献   

19.
Factor clustering methods have been developed in recent years thanks to improvements in computational power. These methods perform a linear transformation of data and a clustering of the transformed data, optimizing a common criterion. Probabilistic distance (PD)-clustering is an iterative, distribution free, probabilistic clustering method. Factor PD-clustering (FPDC) is based on PD-clustering and involves a linear transformation of the original variables into a reduced number of orthogonal ones using a common criterion with PD-clustering. This paper demonstrates that Tucker3 decomposition can be used to accomplish this transformation. Factor PD-clustering alternatingly exploits Tucker3 decomposition and PD-clustering on transformed data until convergence is achieved. This method can significantly improve the PD-clustering algorithm performance; large data sets can thus be partitioned into clusters with increasing stability and robustness of the results. Real and simulated data sets are used to compare FPDC with its main competitors, where it performs equally well when clusters are elliptically shaped but outperforms its competitors with non-Gaussian shaped clusters or noisy data.  相似文献   

20.
Ridge regression is an important approach in linear regression when explanatory variables are highly correlated. Although expressions of estimators of ridge regression parameters have been successfully obtained via matrix operation after observed data are standardized, they cannot be used to big data since it is impossible to load the entire data set to the memory of a single computer and it is hard to standardize the original observed data. To overcome these difficulties, the present article proposes new methods and algorithms. The basic idea is to compute a matrix of sufficient statistics by rows. Once the matrix is derived, it is not necessary to use the original data again. Since the entire data set is only scanned once, the proposed methods and algorithms can be extremely efficient in the computation of estimates of ridge regression parameters. It is expected that the basic knowledge gained in this article will have a great impact on statistical approaches to big data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号