期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Rule induction in data mining: effect of ordinal scales

Helen M. Moshkovich Alexander I. Mechitov David L. Olson 《Expert systems with applications》2002,22(4)

Many classification tasks can be viewed as ordinal. Use of numeric information usually provides possibilities for more powerful analysis than ordinal data. On the other hand, ordinal data allows more powerful analysis when compared to nominal data. It is therefore important not to overlook knowledge about ordinal dependencies in data sets used in data mining. This paper investigates data mining support available from ordinal data. The effect of considering ordinal dependencies in the data set on the overall results of constructing decision trees and induction rules is illustrated. The degree of improved prediction of ordinal over nominal data is demonstrated. When data was very representative and consistent, use of ordinal information reduced the number of final rules with a lower error rate. Data treatment alternatives are presented to deal with data sets having greater imperfections. 相似文献

2.

A hybrid data assimilation scheme for model parameter estimation: Application to morphodynamic modelling

Polly J. Smith Sarah L. Dance Nancy K. Nichols 《Computers & Fluids》2011,46(1):436-441

We present a novel algorithm for joint state-parameter estimation using sequential three dimensional variational data assimilation (3D Var) and demonstrate its application in the context of morphodynamic modelling using an idealised two parameter 1D sediment transport model. The new scheme combines a static representation of the state background error covariances with a flow dependent approximation of the state-parameter cross-covariances. For the case presented here, this involves calculating a local finite difference approximation of the gradient of the model with respect to the parameters. The new method is easy to implement and computationally inexpensive to run. Experimental results are positive with the scheme able to recover the model parameters to a high level of accuracy. We expect that there is potential for successful application of this new methodology to larger, more realistic models with more complex parameterisations. 相似文献

3.

Bayesian model determination for multivariate ordinal and binary data

Emily L. Webb 《Computational statistics & data analysis》2008,52(5):2632-2649

Different conditional independence specifications for ordinal categorical data are compared by calculating a posterior distribution over classes of graphical models. The approach is based on the multivariate ordinal probit model where the data are considered to have arisen as truncated multivariate normal random vectors. By parameterising the precision matrix of the associated multivariate normal in Cholesky form, ordinal data models corresponding to directed acyclic conditional independence graphs for the latent variables can be specified and conveniently computed. Where one or more of the variables are binary this parameterisation is particularly compelling, as necessary constraints on the latent variable distribution can be imposed in such a way that a standard, fully normalised, prior can still be adopted. For comparing different directed graphical models a reversible jump Markov chain Monte Carlo (MCMC) approach is proposed. Where interest is focussed on undirected graphical models, this approach is augmented to allow switches in the orderings of variables of associated directed graphs, hence allowing the posterior distribution over decomposable undirected graphical models to be computed. The approach is illustrated with several examples, involving both binary and ordinal variables, and directed and undirected graphical model classes. 相似文献

4.

A clipped latent variable model for spatially correlated ordered categorical data 总被引：2，自引：0，他引：2

Megan Dailey Higgs Jennifer A. Hoeting 《Computational statistics & data analysis》2010,54(8):1999-490

We propose a model for a point-referenced spatially correlated ordered categorical response and methodology for inference. Models and methods for spatially correlated continuous response data are widespread, but models for spatially correlated categorical data, and especially ordered multi-category data, are less developed. Bayesian models and methodology have been proposed for the analysis of independent and clustered ordered categorical data, and also for binary and count point-referenced spatial data. We combine and extend these methods to describe a Bayesian model for point-referenced (as opposed to lattice) spatially correlated ordered categorical data. We include simulation results and show that our model offers superior predictive performance as compared to a non-spatial cumulative probit model and a more standard Bayesian generalized linear spatial model. We demonstrate the usefulness of our model in a real-world example to predict ordered categories describing stream health within the state of Maryland. 相似文献

5.

Data augmentation strategies for the Bayesian spatial probit regression model

Candace Berrett Catherine A. Calder 《Computational statistics & data analysis》2012,56(3):478-490

The well known latent variable representation of the Bayesian probit regression model due to Albert and Chib (1993) allows model fitting to be performed using a simple Gibbs sampler. In addition, various types of dependence among categorical outcomes not explained by covariate information can be accommodated in a straightforward manner as a result of this latent variable representation of the model. One example of this is the spatial probit regression model for spatially-referenced categorical outcomes. In this setting, commonly used covariance structures for describing residual spatial dependence in the normal linear model setting can be imbedded into the probit regression model. Capturing spatial dependence in this way, however, can negatively impact the performance of MCMC model-fitting algorithms, particularly in terms of mixing and sensitivity to starting values. To address these computational issues, we demonstrate how the non-identifiable spatial variance parameter can be used to create data augmentation MCMC algorithms. We compare the performance of several non-collapsed and partially collapsed data augmentation MCMC algorithms through a simulation study and an analysis of land cover data. 相似文献

6.

Weighted kappa statistic for clustered matched-pair ordinal data

《Computational statistics & data analysis》2015

As an important extension of the regular kappa statistic, the weighted kappa statistic has been widely used to assess the agreement between two procedures for independent matched-pair ordinal data. For clustered matched-pair ordinal data, based on the delta method and sampling techniques, a non-parametric variance estimator for the weighted kappa statistic is proposed without within-cluster correlation structure or distributional assumptions. The results of an extensive Monte Carlo simulation study demonstrate that the proposed weighted kappa statistic provides consistent estimation, and the proposed variance estimator behaves reasonably well for at least a moderately large number of clusters (e.g.,

K \geq 50

). Compared with the variance estimator ignoring dependence within a cluster, the proposed variance estimator performs better in maintaining the nominal coverage probability when the intra-cluster correlation is fair (

ρ \geq 0.3

), with more pronounced improvement when

ρ

is further increased. Moreover, under the general analysis of variance setting with systematic variability between procedures and clusters being included as a component of total variation, the equivalence between weighted kappa statistic and intraclass correlation coefficient is established. To illustrate the practical application of the proposed estimator, two real medical research data examples of clustered matched-pair ordinal data are analyzed, including an agreement study to compare two methods for assessing cervical ectopy, and a physician–patients data example from the Enhancing Communication and HIV Outcomes study. 相似文献

7.

A composite likelihood approach for spatially correlated survival data

Jane Paik Zhiliang Ying 《Computational statistics & data analysis》2012,56(1):209-216

The aim of this paper is to provide a composite likelihood approach to handle spatially correlated survival data using pairwise joint distributions. With e-commerce data, a recent question of interest in marketing research has been to describe spatially clustered purchasing behavior and to assess whether geographic distance is the appropriate metric to describe purchasing dependence. We present a model for the dependence structure of time-to-event data subject to spatial dependence to characterize purchasing behavior from the motivating example from e-commerce data. We assume the Farlie-Gumbel-Morgenstern (FGM) distribution and then model the dependence parameter as a function of geographic and demographic pairwise distances. For estimation of the dependence parameters, we present pairwise composite likelihood equations. We prove that the resulting estimators exhibit key properties of consistency and asymptotic normality under certain regularity conditions in the increasing-domain framework of spatial asymptotic theory. 相似文献

8.

Generating correlated discrete ordinal data using R and SAS IML

Noor Akma Ibrahim Suliadi Suliadi 《Computer methods and programs in biomedicine》2011,104(3):e122

相似文献

9.

Cost efficiency analysis with ordinal data: a theoretical and computational view

《国际计算机数学杂志》2012,89(4):553-562

Standard Data Envelopment Analysis models obtain the cost efficiency of units when the data are known exactly, but these models fail to evaluate the units in the presence of ordinal data. Therefore, this paper provides models for the treatment of ordinal data in cost efficiency analysis. The models have multiplier forms with additional weight restrictions. The main idea in constructing these models is based on the weighted enumeration of the number of inputs/outputs of each unit which are categorized on the same scale rate. Some techniques to reduce the complexity of the models are introduced. 相似文献

10.

基于空间数据仓库的数据采掘

杜明义吉寿松郭达志《计算机工程与应用》2000,(11)

文章介绍了数据采掘技术的定义、数据采掘的过程和主要技术手段以及空间数据仓库的定义、基本结构框架、处理流程和技术支持,分析了基于空间数据仓库的数据采掘特点. 相似文献

11.

Wcrl: A data model independent language for database systems

Sudhir K. Arora K. C. Smith 《International journal of parallel programming》1980,9(4):287-305

The need for data model independent languages for database systems has become apparent in recent years. They can be used for the conceptual level of a database system, for communication in a distributed database system, for data restructuring, and so on. This paper proposes a language,wcrl, to fill this need and compares it with the very few other languages which have been developed almost concurrently to fill the same need. 相似文献

12.

Dropout training for SVMs with data augmentation

Ning CHEN Jun ZHU Jianfei CHEN Ting CHEN 《Frontiers of Computer Science》2018,12(4):694-713

Dropout and other feature noising schemes have shown promise in controlling over-fitting by artificially corrupting the training data. Though extensive studies have been performed for generalized linear models, little has been done for support vector machines (SVMs), one of the most successful approaches for supervised learning. This paper presents dropout training for both linear SVMs and the nonlinear extension with latent representation learning. For linear SVMs, to deal with the intractable expectation of the non-smooth hinge loss under corrupting distributions, we develop an iteratively re-weighted least square (IRLS) algorithm by exploring data augmentation techniques. Our algorithm iteratively minimizes the expectation of a reweighted least square problem, where the re-weights are analytically updated. For nonlinear latent SVMs, we consider learning one layer of latent representations in SVMs and extend the data augmentation technique in conjunction with first-order Taylor-expansion to deal with the intractable expected hinge loss and the nonlinearity of latent representations. Finally, we apply the similar data augmentation ideas to develop a new IRLS algorithm for the expected logistic loss under corrupting distributions, and we further develop a non-linear extension of logistic regression by incorporating one layer of latent representations. Our algorithms offer insights on the connection and difference between the hinge loss and logistic loss in dropout training. Empirical results on several real datasets demonstrate the effectiveness of dropout training on significantly boosting the classification accuracy of both linear and nonlinear SVMs. 相似文献

13.

Mutation-oriented test data augmentation for GUI software fault localization

《Information and Software Technology》2013,55(12):2076-2098

ContextFault localization lies at the heart of program debugging and often proceeds by contrasting the statistics of program constructs executed by passing and failing test cases. A vital issue here is how to obtain these “suitable” test cases. Techniques presented in the literature mostly assume the existence of a large test suite a priori. However, developers often encounter situations where a failure occurs, but where no or no appropriate test suite is available for use to localize the fault.ObjectiveThis paper aims to alleviate this key limitation of traditional fault localization techniques for GUI software particularly, namely, it aims at enabling cost-effective fault localization process for GUI software in the described scenario.MethodTo address this scenario, we propose a mutation-oriented test data augmentation technique, which actually is directed by the “similarity” criterion in GUI software’s test case context towards the generation of test suite with excellent fault localization capabilities. More specifically, the technique mainly uses four proposed novel mutation operators to iteratively mutate some failing GUI test cases’ event sequences to derive new test cases potentially useful to localize the specific encountered fault. We then compare the fault localization performance of the test suite generated using this technique with that of an original provided large event-pair adequate test suite on some GUI applications.ResultsThe results indicate that the proposed technique is capable of generating a test suite that has comparable, if not better, fault localization effectiveness to the event-pair adequate test suite, but it is much smaller and it is generated immediately once a failure is encountered by developers.ConclusionIt is concluded that the proposed technique can truly enable quick-start cost-effective fault localization process under the investigated all-too-common scenario, greatly alleviating one key limitation of traditional fault localization techniques and prompting the test–diagnose–repair cycle. 相似文献

14.

Towards data augmentation in graph neural network: An overview and evaluation

《Computer Science Review》2023

Many studies on Graph Data Augmentation (GDA) approaches have emerged. The techniques have rapidly improved performance for various graph neural network (GNN) models, increasing the current state-of-the-art accuracy by absolute values of 4.20%, 5.50%, and 4.40% on Cora, Citeseer, and PubMed, respectively. The success is attributed to two integral properties of relational approaches: topology-level and feature-level augmentation. This work provides an overview of some GDA algorithms which are reasonably categorized based on these integral properties. Next, we engage the three most widely used GNN backbones (GCN, GAT, and GraphSAGE) as plug-and-play methods for conducting experiments. We conclude by evaluating the algorithm’s effectiveness to demonstrate significant differences among various GDA techniques based on accuracy and time complexity with additional datasets different from those used in the original works. While discussing practical and theoretical motivations, considerations, and strategies for GDA, this work comprehensively investigates the challenges and future direction by pinpointing several open conceivable issues that may require further study based on far-reaching literature interpretation and empirical outcomes. 相似文献

15.

Establish a patent risk prediction model for emerging technologies using deep learning and data augmentation

《Advanced Engineering Informatics》2022

Technology patents are considered the source and bedrock of emerging technologies. Patents create value in any enterprise. However, obtaining patents is time consuming, expensive, and risky; especially if the patent application is rejected. The development of new patents requires extensive costs and resources, but sometimes they may be similar to other patents once the technology is fully developed. They might lack relevant patentable features and as a result, fail to pass the patent examination, resulting in investment losses. Patent infringement is also an especially important topic for reducing the risk of legal damages of patent holders, applicants, and manufacturers. Patent examinations have so far been performed manually. Due to manpower and time limitations, the examination time is exceedingly long and inefficient. Current patent similarity comparison research, and the classification algorithms of text mining are most commonly employed to provide analyses of the possibility of examination approval, but there is insufficient discussion about the possibility of infringement. However, if a new technology or innovation can be accurately determined in advance whether it likely to pass or fail (and why), or is at risk of patent infringement, losses can be mitigated.This research attempts to identify the issues involved in evaluating patent applications and infringement risks from existing patent databases. For each patent application, this research uses Convolutional Neural Networks, CNN + Long Short Term Memory Network, LSTM, prediction model, and the United States Patent and Trademark Office (USPTO) public utility patent application and reviews results based on keyword search. Then, data augmentation is utilized before performing model training; 10% of the approved and rejected applications are randomly selected as test cases, with the remaining 90% of the cases used to train the prediction model of this research in order to determine a model that can predict patent infringement and examination outcomes. Experimental results of the model in this study predicts that the accuracy of each classification is at least 87.7%, and can be used to find the classification of the reason for a rejection of a patent application failure. 相似文献

16.

Goodness-of-fit tests for modeling longitudinal ordinal data

Kuo-Chin Lin 《Computational statistics & data analysis》2010,54(7):1872-1880

Longitudinal studies involving categorical responses are extensively applied in many fields of research and are often fitted by the generalized estimating equations (GEE) approach and generalized linear mixed models (GLMMs). The assessment of model fit is an important issue for model inference. The purpose of this article is to extend Pan’s (2002a) goodness-of-fit tests for GEE models with longitudinal binary data to the tests for logistic proportional odds models with longitudinal ordinal data. Two proposed methods based on Pearson chi-squared test and unweighted sum of residual squares are developed, and the approximate expectations and variances of the test statistics are easily computed. Four major variants of working correlation structures, independent, AR(1), exchangeable and unspecified, are considered to estimate the variances of the proposed test statistics. Simulation studies in terms of type I error rate and the power performance of the proposed tests are presented for various sample sizes. Furthermore, the approaches are demonstrated by two real data sets. 相似文献

17.

空间数据挖掘的研究与发展 总被引：7，自引：0，他引：7

陈中祥岳超源《计算机工程与应用》2003,39(3):5-7,33

随着空间数据获取手段的快速发展,从大量空间数据中自动、快速、有效地发现知识显得越来越重要。该文简要介绍了空间数据挖掘技术的产生背景和发展现状,对空间数据挖掘的基本理论和主要研究领域进行了概述,总结了空间数据挖掘近年来的研究成果,并对下一步发展做了展望。相似文献

18.

基于数据增强和相似伪标签的半监督文本分类算法

沈海龙盛晓辉《计算机应用研究》2023,40(4):1019-1023+1051

为了减少对有标记数据的依赖,充分利用大量无标记数据,提出了一个基于数据增强和相似伪标签的半监督文本分类算法(semi-supervised text classification algorithm with data augmentation and similar pseudo-labels, STAP)。该算法利用EPiDA(easy plug-in data augmentation)框架和自训练对少量有标记数据进行扩充,采用一致性训练和相似伪标签考虑无标记数据及其增强样本之间的关系和高置信度的相似无标记数据之间的关系,在有监督交叉熵损失、无监督一致性损失和无监督配对损失的约束下,提高无标记数据的质量。在四个文本分类数据集上进行实验,与其他经典的文本分类算法相比,STAP算法有明显的改进效果。相似文献

19.

Times-series data augmentation and deep learning for construction equipment activity recognition

《Advanced Engineering Informatics》2019

Automated, real-time, and reliable equipment activity recognition on construction sites can help to minimize idle time, improve operational efficiency, and reduce emissions. Previous efforts in activity recognition of construction equipment have explored different classification algorithms anm accelerometers and gyroscopes. These studies utilized pattern recognition approaches such as statistical models (e.g., hidden-Markov models); shallow neural networks (e.g., Artificial Neural Networks); and distance algorithms (e.g., K-nearest neighbor) to classify the time-series data collected from sensors mounted on the equipment. Such methods necessitate the segmentation of continuous operational data with fixed or dynamic windows to extract statistical features. This heuristic and manual feature extraction process is limited by human knowledge and can only extract human-specified shallow features. However, recent developments in deep neural networks, specifically recurrent neural network (RNN), presents new opportunities to classify sequential time-series data with recurrent lateral connections. RNN can automatically learn high-level representative features through the network instead of being manually designed, making it more suitable for complex activity recognition. However, the application of RNN requires a large training dataset which poses a practical challenge to obtain from real construction sites. Thus, this study presents a data-augmentation framework for generating synthetic time-series training data for an RNN-based deep learning network to accurately and reliably recognize equipment activities. The proposed methodology is validated by generating synthetic data from sample datasets, that were collected from two earthmoving operations in the real world. The synthetic data along with the collected data were used to train a long short-term memory (LSTM)-based RNN. The trained model was evaluated by comparing its performance with traditionally used classification algorithms for construction equipment activity recognition. The deep learning framework presented in this study outperformed the traditionally used machine learning classification algorithms for activity recognition regarding model accuracy and generalization. 相似文献

20.

多任务实时声音事件检测卷积模型与复合数据扩增

刘臣倪仁倢周立欣《计算机应用研究》2023,40(4):1080-1087

现有的声音事件检测研究多为对离线音频进行分析,且模型参数量较多、计算效率低,不适用于实时检测。提出一种面向多任务实时声音事件检测的轻量化卷积神经网络模型,它将唤醒与检测任务整合成多任务学习框架,此外模型的卷积结构联合了稠密连接、Ghost模组与SE注意力机制;另外还提出了一种复合数据扩增方法,将音频变换、随机裁剪与频谱掩蔽相结合。实验结果显示,该模型在ESC-10和Urbansound8K数据集上的平均预测准确率高于当前新型的基线模型2%以上,同时模型的参数和内存更少。研究表明,多任务学习的方式节省了计算量,又因为卷积结构复用了中间层特征,模型可以快速地反馈检测结果。另外,复合数据方法相比传统方法使模型获得了更好的性能和鲁棒性。相似文献