首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Irregularities are widespread in large databases and often lead to erroneous conclusions with respect to data mining and statistical analysis. For example, considerable bias is often resulted from many parameter estimation procedures without properly handling significant irregularities. Most data cleaning tools assume one known type of irregularity. This paper proposes a generic Irregularity Enlightenment (IE) framework for dealing with the situation when multiple irregularities are hidden in large volumes of data in general and cross sectional time series in particular. It develops an automatic data mining platform to capture key irregularities and classify them based on their importance in a database. By decomposing time series data into basic components, we propose to optimize a penalized least square loss function to aid the selection of key irregularities in consecutive steps and cluster time series into different groups until an acceptable level of variation reduction is achieved. Finally visualization tools are developed to help analysts interpret and understand the nature of data better and faster before further data modeling and analysis.  相似文献   

2.
In this paper we provide evidence of the benefits of an approach which combines data mining and mathematical programming to determining the premium to charge automobile insurance policy holders in order to arrive at an optimal portfolio. An non-linear integer programming formulation is proposed to determine optimal premiums based on the insurer's need to find a balance between profitability and market share. The non-linear integer programming approach to solving this problem is used within a data mining framework which consists of three components: classifying policy holders into homogenous risk groups and predicting the claim cost of each group using k-means clustering; determining the price sensitivity (propensity to pay) of each group using neural networks; and combining the results of the first two components to determine the optimal premium to charge. We have earlier presented the results of the first two components. In this paper we present the results of the third component. Using our approach, we have been able to increase revenue without affecting termination rates and market share.  相似文献   

3.
We present a general framework for studying harmonic analysis of functions in the settings of various emerging problems in the theory of diffusion geometry. The starting point of the now classical diffusion geometry approach is the construction of a kernel whose discretization leads to an undirected graph structure on an unstructured data set. We study the question of constructing such kernels for directed graph structures, and argue that our construction is essentially the only way to do so using discretizations of kernels. We then use our previous theory to develop harmonic analysis based on the singular value decomposition of the resulting non-self-adjoint operators associated with the directed graph. Next, we consider the question of how functions defined on one space evolve to another space in the paradigm of changing data sets recently introduced by Coifman and Hirn. While the approach of Coifman and Hirn requires that the points on one space should be in a known one-to-one correspondence with the points on the other, our approach allows the identification of only a subset of landmark points. We introduce a new definition of distance between points on two spaces, construct localized kernels based on the two spaces and certain interaction parameters, and study the evolution of smoothness of a function on one space to its lifting to the other space via the landmarks. We develop novel mathematical tools that enable us to study these seemingly different problems in a unified manner.  相似文献   

4.
Data preprocessing is an important and critical step in the data mining process and it has a huge impact on the success of a data mining project. In this paper, we present an algorithm DB-HReduction, which discretizes or eliminates numeric attributes and generalizes or eliminates symbolic attributes very efficiently and effectively. This algorithm greatly decreases the number of attributes and tuples of the data set and improves the accuracy and decreases the running time of the data mining algorithms in the later stage.  相似文献   

5.
We consider the problem of discrete time filtering (intermittent data assimilation) for differential equation models and discuss methods for its numerical approximation. The focus is on methods based on ensemble/particle techniques and on the ensemble Kalman filter technique in particular. We summarize as well as extend recent work on continuous ensemble Kalman filter formulations, which provide a concise dynamical systems formulation of the combined dynamics-assimilation problem. Possible extensions to fully nonlinear ensemble/particle based filters are also outlined using the framework of optimal transportation theory.  相似文献   

6.
Data analysis and data mining are concerned with unsupervised pattern finding and structure determination in data sets. The data sets themselves are explicitly linked as a form of representation to an observational, or otherwise empirical, domain of interest. “Structure” has long been understood as symmetry which can take many forms with respect to any transformation, including point, translational, rotational, and many others. Symmetries directly point to invariants that pinpoint intrinsic properties of the data and of the background empirical domain of interest. As our data models change, so too do our perspectives on analyzing data. The structures in data surveyed here are based on hierarchy, represented as p-adic numbers or an ultrametric topology.  相似文献   

7.
Summary  An increasingly important problem in exploratory data analysis and visualization is that of scale; more and more data sets are much too large to analyze using traditional techniques, either in terms of the number of variables or the number of records. One approach to addressing this problem is the development and use of multiresolution strategies, where we represent the data at different levels of abstraction or detail through aggregation and summarization. In this paper we present an overview of our recent and current activities in the development of a multiresolution exploratory visualization environment for large-scale multivariate data. We have developed visualization, interaction, and data management techniques for effectively dealing with data sets that contain millions of records and/or hundreds of dimensions, and propose methods for applying similar approaches to extend the system to handle nominal as well as ordinal data.  相似文献   

8.
Sustainable product design has been considered as one of the most important practices for achieving sustainability. To improve the environmental performances of a product through product design, however, a firm often needs to deal with some difficult technical trade-offs between traditional and environmental attributes which require new design concepts and engineering specifications. In this paper, we propose a novel use of the two-stage network Data Envelopment Analysis (DEA) to evaluate sustainable product design performances. We conceptualize “design efficiency” as a key measurement of design performance in terms of how well multiple product specifications and attributes are combined in a product design that leads to lower environmental impacts or better environmental performances. A two-stage network DEA model is developed for sustainable design performance evaluation with an “industrial design module” and a “bio design module.” To demonstrate the applications of our DEA-based methodology, we use data of key engineering specifications, product attributes, and emissions performances in the vehicle emissions testing database published by the US EPA to evaluate the sustainable design performances of different automobile manufacturers. Our test results show that sustainable design does not need to mean compromise between traditional and environmental attributes. Through addressing the interrelatedness of subsystems in product design, a firm can find the most efficient way to combine product specifications and attributes which leads to lower environmental impacts or better environmental performances. This paper contributes to the existing literature by developing a new research framework for evaluating sustainable design performances as well as by proposing an innovative application of the two-stage network DEA for finding the most eco-efficient way to achieve better environmental performances through product design.  相似文献   

9.
Sex determination mainly encompasses two aspects: genotypic sex determination (GSD) and temperature-dependent sex determination (TSD). Genotypic sex determination performs its task by observing the presence of sex chromosomes. In many reptiles sex determination is greatly influenced by the environmental conditions such as temperature of the nest, weight and size of eggs. A nature inspired algorithm which mimics the mechanism of temperature dependent sex determination (TSD) has been introduced for mining the classification rules from datasets. A comparison of proposed TSD algorithm with other well known rule induction algorithms like PRISM, C4.5, 1-R, CN2, and NN has been evaluated on some bench mark datasets.  相似文献   

10.
上市公司财务危机预警分析——基于数据挖掘的研究   总被引:3,自引:0,他引:3  
刘旻  罗慧 《数理统计与管理》2004,23(3):51-56,68
本文以我国上市公司为研究对象,选取了1999-2001年被ST的公司和正常公司各73家作为训练样本,2002年被ST的公司和正常公司各43家作为检验样本,分析了财务危机出现前2年内各年两类公司15个财务指标。在进行数据挖掘中,我们运用了三种独立的方法,分别为判别分析、Logistic回归和神经网络,结果发现神经网络预测的效果要优于其它两种方法。最后,结合了这些方法的优点,建立了一种混合模型,研究表明预测的正确性要高于每种单独方法,从而提高了模型的预警效果。  相似文献   

11.
Business networking for the purpose of becoming globally more competitive seems to form the very basis of strategic decisions in many companies today. The concept of “network company” has recently been the subject of many studies in the literature, perhaps mostly due to its world wide practice among more successful companies. Yet, there is no model-based formal treatment of the concept per se leading to the development of frameworks that are instrumental in formulating networking strategies. This paper addresses itself to formalizing the concept of “network company” within the context of global competition. For this purpose, “network company” is positioned in the value chain of pertinent product–market chain systems and then its functioning is decomposed into a set of minimal and basic components, which are termed “elementary resources, methods, products, and activities”. The set thus defined at that detail level is used to analyze and evaluate “network companies” at any desired condensed level reflecting the needs of a project or a function for the purpose of competitive strategy formulation. The formal analytical framework developed is then discussed in association with three basic approaches to competitive strategy formulation: resource-based strategy, activity-based strategy, and strategy based on the economic theory of the firm. The usefulness of the proposed framework in connection with these approaches is expressed in terms of formal propositions.  相似文献   

12.
Denoising analysis imposes new challenge for mining high-frequency financial data due to its irregularities and roughness. Inefficient decomposition of the systematic pattern (the trend) and noises of high-frequency data will lead to erroneous conclusion as the irregularities and roughness of the data make the application of traditional methods difficult. In this paper, we propose the local linear scaling approximation (in short, LLSA) algorithm, a new nonlinear filtering algorithm based on the linear maximal overlap discrete wavelet transform (MODWT) to decompose the systematic pattern and noises. We show several unique properties of this brand-new algorithm, that are, the local linearity, computational complexity, and consistency. We conduct a simulation study to confirm these properties we have analytically shown and compare the performance of LLSA with MODWT. We then apply our new algorithm with the real high-frequency data from German equity market to investigate its implementation in forecasting. We show the superior performance of LLSA and conclude that it can be applied with flexible settings and suitable for high-frequency data mining.  相似文献   

13.
Advances in Data Analysis and Classification - A growing number of problems in data analysis and classification involve data that are non-Euclidean. For such problems, a naive application of vector...  相似文献   

14.
In this note, we address the problem of surrogacy using a causal modelling framework that differs substantially from the potential outcomes model that pervades the biostatistical literature. The framework comes from econometrics and conceptualizes direct effects of the surrogate endpoint on the true endpoint. While this framework can incorporate the so-called semi-competing risks data structure, we also derive a fundamental non-identifiability result. Relationships to existing causal modelling frameworks are also discussed.  相似文献   

15.
A basic premise in the development of yield management has been that the differentiated fare products offered by airlines are targeted to distinct segments of the total demand for air travel in a market, each of which compete for space on a fixed capacity aircraft. Such representations of differential pricing assume that the airline can segment its demand perfectly and without cost to consumers, and further, ignore the dependence of the demand for a given fare product on the price levels and characteristics of the other available fare products. In this paper, we introduce a new ‘generalised cost’ model of fare product differentiation that incorporates the relationships between available airline fare products as well as the cost incurred by consumers of accepting more restrictions. We extend the model to incorporate the diversion of passengers to lower-priced fare products as a result of their ability to meet the additional restrictions imposed by airlines, and then demonstrate how seat inventory control can be used to induce diverting passengers to ‘sell up’ to higher-priced fare products by applying booking limits. An example of how the model can be used for joint fare product price level optimisation is presented, along with a numerical example that illustrates the extent to which the conventional model of price discrimination over-estimates passenger demand and, in turn, total airline revenues.  相似文献   

16.
This study presents a data mining analysis of forecasting patterns in a supply chain. Multiple customers who are auto manufacturers order from a large auto parts supplier. The auto manufacturers provide forecasts for future orders and update them before the due date. The supplier uses these forecasts to plan production in advance. The accuracy of the forecasts varies from customer to customer. We provide a framework to analyze the forecast performance of the customers. There are different complexities in forecasts that are captured from our data set. Daily flow analysis helps to transform data and obtain accuracy ratios of forecasts. Customers are then classified based on their forecast performances. We demonstrate the application of some recent developments in clustering and pattern recognition analysis to performance analysis of customers.  相似文献   

17.
Data mining is generally defined as the science of nontrivial extraction of implicit, previously unknown, and potentially useful information from datasets. There are many websites on the Internet that provide extensive information about products and allow users post comments on various products and rate the product on a scale of 1 to 5. During the past decade, the need for intelligent algorithms for calculating and organizing extremely large sets of data has grown exponentially. In this article we investigate the extent to which a product’s average user rating can be predicted, using a manageable subset of a data set. For this we use a linearization-algorithm based prediction model and sketch how an inverse problem can be formulated to yield a smooth local volatility function of user ratings. The MAPLE programs that implement the proposed algorithm show that the method is reasonably accurate for the reconstruction of volatility of user ratings, which is useful both in accurate user predictions as well as computing sensitivity.  相似文献   

18.
19.

In the paper, we consider sequential decision problems with uncertainty, represented as decision trees. Sensitivity analysis is always a crucial element of decision making and in decision trees it often focuses on probabilities. In the stochastic model considered, the user often has only limited information about the true values of probabilities. We develop a framework for performing sensitivity analysis of optimal strategies accounting for this distributional uncertainty. We design this robust optimization approach in an intuitive and not overly technical way, to make it simple to apply in daily managerial practice. The proposed framework allows for (1) analysis of the stability of the expected-value-maximizing strategy and (2) identification of strategies which are robust with respect to pessimistic/optimistic/mode-favoring perturbations of probabilities. We verify the properties of our approach in two cases: (a) probabilities in a tree are the primitives of the model and can be modified independently; (b) probabilities in a tree reflect some underlying, structural probabilities, and are interrelated. We provide a free software tool implementing the methods described.

  相似文献   

20.
Supervised classification is an important part of corporate data mining to support decision making in customer-centric planning tasks. The paper proposes a hierarchical reference model for support vector machine based classification within this discipline. The approach balances the conflicting goals of transparent yet accurate models and compares favourably to alternative classifiers in a large-scale empirical evaluation in real-world customer relationship management applications. Recent advances in support vector machine oriented research are incorporated to approach feature, instance and model selection in a unified framework.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号