期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Automated bug assignment: Ensemble-based machine learning in large scale industrial contexts

Leif?Jonsson Email author View author&#;s OrcID profile Markus?Borg David?Broman Kristian?Sandahl Sigrid?Eldh Per?Runeson 《Empirical Software Engineering》2016,21(4):1533-1578

Bug report assignment is an important part of software maintenance. In particular, incorrect assignments of bug reports to development teams can be very expensive in large software development projects. Several studies propose automating bug assignment techniques using machine learning in open source software contexts, but no study exists for large-scale proprietary projects in industry. The goal of this study is to evaluate automated bug assignment techniques that are based on machine learning classification. In particular, we study the state-of-the-art ensemble learner Stacked Generalization (SG) that combines several classifiers. We collect more than 50,000 bug reports from five development projects from two companies in different domains. We implement automated bug assignment and evaluate the performance in a set of controlled experiments. We show that SG scales to large scale industrial application and that it outperforms the use of individual classifiers for bug assignment, reaching prediction accuracies from 50 % to 89 % when large training sets are used. In addition, we show how old training data can decrease the prediction accuracy of bug assignment. We advice industry to use SG for bug assignment in proprietary contexts, using at least 2,000 bug reports for training. Finally, we highlight the importance of not solely relying on results from cross-validation when evaluating automated bug assignment. 相似文献

2.

基于缺陷相似度与再分配图的软件缺陷分配方法

史高翔赵逢禹《计算机科学》2016,43(11):246-251

准确地将缺陷分配给最合适的修复者对大型软件项目的缺陷修复具有重要意义。当前缺陷自动分配技术的研究主要利用历史缺陷报告的描述信息、缺陷关联信息、历史分派信息等,但这些方法都没有将缺陷报告信息充分挖掘。提出在缺陷报告分配时将缺陷历史分派信息和缺陷文本相似信息相结合。首先根据缺陷历史分派信息生成再分配图;然后计算新缺陷报告与历史缺陷报告缺陷的文本相似度,找出相似度最高的前K个缺陷报告所对应的修复者;最后,根据这些修复者在再分配图中的依赖关系生成预测再分配路径。为了验证该方法的有效性,利用Eclipse和Mozilla的缺陷报告集进行实验,实验表明提出的方法在预测的准确度上明显优于其他方法。相似文献

3.

Bug Triaging Based on Tossing Sequence Modeling

下载免费PDF全文

Sheng-Qu Xi Yuan Yao Member CCF Xu-Sheng Xiao Member ACM IEEE Feng Xu Member CCF Jian Lv Fellow CCF 《计算机科学技术学报》2019,34(5):942-956

Bug triaging, which routes the bug reports to potential fixers, is an integral step in software development and maintenance. To make bug triaging more efficient, many researchers propose to adopt machine learning and information retrieval techniques to identify some suitable fixers for a given bug report. However, none of the existing proposals simultaneously take into account the following three aspects that matter for the efficiency of bug triaging:1) the textual content in the bug reports, 2) the metadata in the bug reports, and 3) the tossing sequence of the bug reports. To simultaneously make use of the above three aspects, we propose iTriage which first adopts a sequence-to-sequence model to jointly learn the features of textual content and tossing sequence, and then uses a classification model to integrate the features from textual content, metadata, and tossing sequence. Evaluation results on three different open-source projects show that the proposed approach has significantly improved the accuracy of bug triaging compared with the state-of-the-art approaches. 相似文献

4.

An empirical study of factors affecting cross-project aging-related bug prediction with TLAP

Qin Fangyun Wan Xiaohui Yin Beibei 《Software Quality Journal》2020,28(1):107-134

Software aging is a phenomenon in which long-running software systems show an increasing failure rate and/or progressive performance degradation. Due to their nature, Aging-Related Bugs (ARBs) are hard to discover during software testing and are also challenging to reproduce. Therefore, automatically predicting ARBs before software release can help developers reduce ARB impact or avoid ARBs. Many bug prediction approaches have been proposed, and most of them show effectiveness in within-project prediction settings. However, due to the low presence and reproducing difficulty of ARBs, it is usually hard to collect sufficient training data to build an accurate prediction model. A recent work proposed a method named Transfer Learning based Aging-related bug Prediction (TLAP) for performing cross-project ARB prediction. Although this method considerably improves cross-project ARB prediction performance, it has been observed that its prediction result is affected by several key factors, such as the normalization methods, kernel functions, and machine learning classifiers. Therefore, this paper presents the first empirical study to examine the impact of these factors on the effectiveness of cross-project ARB prediction in terms of single-factor pattern, bigram pattern, and triplet pattern and validates the results with the Scott-Knott test technique. We find that kernel functions and classifiers are key factors affecting the effectiveness of cross-project ARB prediction, while normalization methods do not show statistical influence. In addition, the order of values in three single-factor patterns is maintained in three bigram patterns and one triplet pattern to a large extent. Similarly, the order of values in the three bigram patterns is also maintained in the triplet pattern.

相似文献

5.

基于深度自编码网络的软件缺陷预测方法

周末徐玲杨梦宁廖胜平鄢萌《计算机工程与科学》2018,40(10):1796-1804

软件缺陷预测是提升软件质量的有效方法,而软件缺陷预测方法的预测效果与数据集自身的特点有着密切的相关性。针对软件缺陷预测中数据集特征信息冗余、维度过大的问题,结合深度学习对数据特征强大的学习能力,提出了一种基于深度自编码网络的软件缺陷预测方法。该方法首先使用一种基于无监督学习的采样方法对6个开源项目数据集进行采样,解决了数据集中类不平衡问题;然后训练出一个深度自编码网络模型。该模型能对数据集进行特征降维,模型的最后使用了三种分类器进行连接,该模型使用降维后的训练集训练分类器,最后用测试集进行预测。实验结果表明,该方法在维数较大、特征信息冗余的数据集上的预测性能要优于基准的软件缺陷预测模型和基于现有的特征提取方法的软件缺陷预测模型,并且适用于不同分类算法。相似文献

6.

Multiple kernel ensemble learning for software defect prediction

Tiejian Wang Zhiwu Zhang Xiaoyuan Jing Liqiang Zhang 《Automated Software Engineering》2016,23(4):569-590

Software defect prediction aims to predict the defect proneness of new software modules with the historical defect data so as to improve the quality of a software system. Software historical defect data has a complicated structure and a marked characteristic of class-imbalance; how to fully analyze and utilize the existing historical defect data and build more precise and effective classifiers has attracted considerable researchers’ interest from both academia and industry. Multiple kernel learning and ensemble learning are effective techniques in the field of machine learning. Multiple kernel learning can map the historical defect data to a higher-dimensional feature space and make them express better, and ensemble learning can use a series of weak classifiers to reduce the bias generated by the majority class and obtain better predictive performance. In this paper, we propose to use the multiple kernel learning to predict software defect. By using the characteristics of the metrics mined from the open source software, we get a multiple kernel classifier through ensemble learning method, which has the advantages of both multiple kernel learning and ensemble learning. We thus propose a multiple kernel ensemble learning (MKEL) approach for software defect classification and prediction. Considering the cost of risk in software defect prediction, we design a new sample weight vector updating strategy to reduce the cost of risk caused by misclassifying defective modules as non-defective ones. We employ the widely used NASA MDP datasets as test data to evaluate the performance of all compared methods; experimental results show that MKEL outperforms several representative state-of-the-art defect prediction methods. 相似文献

7.

基于半监督学习和支持向量机的煤与瓦斯突出预测研究 总被引：1，自引：1，他引：0

孙云霄方健马小平《工矿自动化》2012,38(11):40-42

针对支持向量机要求输入向量为已标记样本,而实际应用中已标记样本很难获取的问题,提出将半监督学习和支持向量机结合的煤与瓦斯突出预测方法;介绍了采用SVM预测煤与瓦斯突出的流程及其输入向量的选择;对半监督学习中的协同训练算法进行了改进:在同一属性集上训练2个不同分类器SVM和KNN,将2个分类器标记一致的样本加入训练集,从而充分利用未标记样本不断补充信息,更新训练集标记样本,达到强化训练集的目的。测试结果表明,改进后的算法比单独的支持向量机预测方法准确率更高。相似文献

8.

融合文本分布式表示的重复缺陷报告检测

曾杰贲可荣张献徐永士《计算机工程与科学》2021,43(4):670-680

重复缺陷报告检测能够避免对描述同一缺陷的多份报告进行重复的任务分派和修复,可降低软件维护成本。为了进一步提高检测的准确率,提出一种融合文本分布式表示的重复缺陷报告检测方法。首先,基于大规模缺陷报告数据库训练Doc2Vec模型并抽取缺陷报告的分布式表示,将不同长度的缺陷报告编码为统一长度的稠密向量。接着,通过比较这些向量来计算不同缺陷报告的相似程度,将其作为一种新特征与重复缺陷报告检测过程常用的其它特征进行融合,并利用机器学习算法训练二元分类模型。在公开的Bugzilla重复缺陷报告数据集上的实验结果表明,相比于代表性方法D_TS,本文方法的F1值平均提升了2%,说明了新特征的有效性。相似文献

9.

A semi-automatic approach for workflow staff assignment

Yingbo Jianmin Yun Jiaguang 《Computers in Industry》2008,59(5):463-476

Staff assignment is of great importance for workflow management systems. In many workflow applications, staff assignment is still performed manually. In this paper, we present a semi-automatic approach intended to reduce the number of manual staff assignment. Our approach applies a machine learning algorithm to the workflow event log to learn various kinds of activities that each actor undertakes. When staff assignment is needed, the classifiers generated by the machine learning technique suggest a suitable actor to undertake the specified activities. With experiments on three enterprises, our approach achieved a fairly accurate recommendation. 相似文献

10.

基于深度学习的安全缺陷报告预测方法实证研究

郑炜陈军正吴潇雪陈翔夏鑫《软件学报》2020,31(5):1294-1313

软件安全问题的发生在大多数情况下会造成非常严重的后果，及早发现安全问题，是预防安全事故的关键手段之一.安全缺陷报告预测可以辅助开发人员及早发现被测软件中潜藏的安全缺陷，从而尽早得以修复.然而，由于安全缺陷在实际项目中的数量较少，而且特征复杂（即安全缺陷类型繁多，不同类型安全缺陷特征差异性较大），这使得手工提取特征相对困难，并随后造成传统机器学习分类算法在安全缺陷报告预测性能方面存在一定的瓶颈.针对该问题，提出基于深度学习的安全缺陷报告预测方法，采用深度文本挖掘模型TextCNN和TextRNN构建安全缺陷报告预测模型；针对安全缺陷报告文本特征，使用skip-grams方式构建词嵌入矩阵，并借助注意力机制对TextRNN模型进行优化.所构建的模型在5个不同规模的安全缺陷报告数据集上展开了大规模实证研究，实证结果表明：深度学习模型在80%的实验案例中都要优于传统机器学习分类算法，性能指标F1-score平均可提升0.258，在最好的情况下甚至可以提升0.535.除此之外，针对安全缺陷报告数据集存在的类不均衡问题，对不同采样方法进行了实证研究，并对结果进行了分析. 相似文献

11.

机器学习在乳腺肿瘤分类检测中的应用研究

李喆吕卫闵行褚晶辉《计算机工程与科学》2016,38(11):2303-2309

机器学习算法在医学检测与诊断,尤其是乳腺肿瘤分类检测与诊断中扮演愈发重要的角色。分析比较了几种经典机器学习分类器在乳腺肿瘤分类检测中的性能,并从准确率、灵敏度、特异性及执行效率等方面对各分类器的性能进行了评估比较,根据在不同数据库上的实验结果,总结了各机器学习分类器在乳腺肿瘤分类中的性能特点:线性判别分析和极限学习机两种分类器性能优良且训练效率很高;支持向量机性能较为平均且非常稳定,但训练耗时较长;而人工神经网络分类器虽然可以给出良好的特异性指标,但灵敏度指标不够理想。相似文献

12.

A novel ensemble learning approach to unsupervised record linkage

《Information Systems》2017

Record linkage is a process of identifying records that refer to the same real-world entity. Many existing approaches to record linkage apply supervised machine learning techniques to generate a classification model that classifies a pair of records as either match or non-match. The main requirement of such an approach is a labelled training dataset. In many real-world applications no labelled dataset is available hence manual labelling is required to create a sufficiently sized training dataset for a supervised machine learning algorithm. Semi-supervised machine learning techniques, such as self-learning or active learning, which require only a small manually labelled training dataset have been applied to record linkage. These techniques reduce the requirement on the manual labelling of the training dataset. However, they have yet to achieve a level of accuracy similar to that of supervised learning techniques. In this paper we propose a new approach to unsupervised record linkage based on a combination of ensemble learning and enhanced automatic self-learning. In the proposed approach an ensemble of automatic self-learning models is generated with different similarity measure schemes. In order to further improve the automatic self-learning process we incorporate field weighting into the automatic seed selection for each of the self-learning models. We propose an unsupervised diversity measure to ensure that there is high diversity among the selected self-learning models. Finally, we propose to use the contribution ratios of self-learning models to remove those with poor accuracy from the ensemble. We have evaluated our approach on 4 publicly available datasets which are commonly used in the record linkage community. Our experimental results show that our proposed approach has advantages over the state-of-the-art semi-supervised and unsupervised record linkage techniques. In 3 out of 4 datasets it also achieves comparable results to those of the supervised approaches. 相似文献

13.

Using neural network ensembles for bankruptcy prediction and credit scoring 总被引：2，自引：0，他引：2

Chih-Fong Tsai Jhen-Wei Wu 《Expert systems with applications》2008,34(4):2639-2649

Bankruptcy prediction and credit scoring have long been regarded as critical topics and have been studied extensively in the accounting and finance literature. Artificial intelligence and machine learning techniques have been used to solve these financial decision-making problems. The multilayer perceptron (MLP) network trained by the back-propagation learning algorithm is the mostly used technique for financial decision-making problems. In addition, it is usually superior to other traditional statistical models. Recent studies suggest combining multiple classifiers (or classifier ensembles) should be better than single classifiers. However, the performance of multiple classifiers in bankruptcy prediction and credit scoring is not fully understood. In this paper, we investigate the performance of a single classifier as the baseline classifier to compare with multiple classifiers and diversified multiple classifiers by using neural networks based on three datasets. By comparing with the single classifier as the benchmark in terms of average prediction accuracy, the multiple classifiers only perform better in one of the three datasets. The diversified multiple classifiers trained by not only different classifier parameters but also different sets of training data perform worse in all datasets. However, for the Type I and Type II errors, there is no exact winner. We suggest that it is better to consider these three classifier architectures to make the optimal financial decision. 相似文献

14.

A systematic review of machine learning techniques for software fault prediction

《Applied Soft Computing》2015

BackgroundSoftware fault prediction is the process of developing models that can be used by the software practitioners in the early phases of software development life cycle for detecting faulty constructs such as modules or classes. There are various machine learning techniques used in the past for predicting faults.MethodIn this study we perform a systematic review of studies from January 1991 to October 2013 in the literature that use the machine learning techniques for software fault prediction. We assess the performance capability of the machine learning techniques in existing research for software fault prediction. We also compare the performance of the machine learning techniques with the statistical techniques and other machine learning techniques. Further the strengths and weaknesses of machine learning techniques are summarized.ResultsIn this paper we have identified 64 primary studies and seven categories of the machine learning techniques. The results prove the prediction capability of the machine learning techniques for classifying module/class as fault prone or not fault prone. The models using the machine learning techniques for estimating software fault proneness outperform the traditional statistical models.ConclusionBased on the results obtained from the systematic review, we conclude that the machine learning techniques have the ability for predicting software fault proneness and can be used by software practitioners and researchers. However, the application of the machine learning techniques in software fault prediction is still limited and more number of studies should be carried out in order to obtain well formed and generalizable results. We provide future guidelines to practitioners and researchers based on the results obtained in this work. 相似文献

15.

Efficiently processing deterministic approximate aggregation query on massive data

Xixian Han Bailing Wang Jianzhong Li Hong Gao 《Knowledge and Information Systems》2018,54(2):437-462

Financial distress prediction is very important to financial institutions who must be able to make critical decisions regarding customer loans. Bankruptcy prediction and credit scoring are the two main aspects considered in financial distress prediction. To assist in this determination, thereby lowering the risk borne by the financial institution, it is necessary to develop effective prediction models for prediction of the likelihood of bankruptcy and estimation of credit risk. A number of financial distress prediction models have been constructed, which utilize various machine learning techniques, such as single classifiers and classifier ensembles, but improving the prediction accuracy is the major research issue. In addition, aside from improving the prediction accuracy, there have been very few studies that specifically consider lowering the Type I error. In practice, Type I errors need to receive careful consideration during model construction because they can affect the cost to the financial institution. In this study, we introduce a classifier ensemble approach designed to reduce the misclassification cost. The outputs produced by multiple classifiers are combined by utilizing the unanimous voting (UV) method to find the final prediction result. Experimental results obtained based on four relevant datasets show that our UV ensemble approach outperforms the baseline single classifiers and classifier ensembles. Specifically, the UV ensemble not only provides relatively good prediction accuracy and minimizes Type I/II errors, but also produces the smallest misclassification cost. 相似文献

16.

Feature selection in the Laplacian support vector machine

Sangjun Lee Ja-Yong Koo 《Computational statistics & data analysis》2011,55(1):567-577

Traditional classifiers including support vector machines use only labeled data in training. However, labeled instances are often difficult, costly, or time consuming to obtain while unlabeled instances are relatively easy to collect. The goal of semi-supervised learning is to improve the classification accuracy by using unlabeled data together with a few labeled data in training classifiers. Recently, the Laplacian support vector machine has been proposed as an extension of the support vector machine to semi-supervised learning. The Laplacian support vector machine has drawbacks in its interpretability as the support vector machine has. Also it performs poorly when there are many non-informative features in the training data because the final classifier is expressed as a linear combination of informative as well as non-informative features. We introduce a variant of the Laplacian support vector machine that is capable of feature selection based on functional analysis of variance decomposition. Through synthetic and benchmark data analysis, we illustrate that our method can be a useful tool in semi-supervised learning. 相似文献

17.

Using random forest and decision tree models for a new vehicle prediction approach in computational toxicology

Pritesh Mistry Daniel Neagu Paul R. Trundle Jonathan D. Vessey 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2016,20(8):2967-2979

Drug vehicles are chemical carriers that provide beneficial aid to the drugs they bear. Taking advantage of their favourable properties can potentially allow the safer use of drugs that are considered highly toxic. A means for vehicle selection without experimental trial would therefore be of benefit in saving time and money for the industry. Although machine learning is increasingly used in predictive toxicology, to our knowledge there is no reported work in using machine learning techniques to model drug-vehicle relationships for vehicle selection to minimise toxicity. In this paper we demonstrate the use of data mining and machine learning techniques to process, extract and build models based on classifiers (decision trees and random forests) that allow us to predict which vehicle would be most suited to reduce a drug’s toxicity. Using data acquired from the National Institute of Health’s (NIH) Developmental Therapeutics Program (DTP) we propose a methodology using an area under a curve (AUC) approach that allows us to distinguish which vehicle provides the best toxicity profile for a drug and build classification models based on this knowledge. Our results show that we can achieve prediction accuracies of 80 % using random forest models whilst the decision tree models produce accuracies in the 70 % region. We consider our methodology widely applicable within the scientific domain and beyond for comprehensively building classification models for the comparison of functional relationships between two variables. 相似文献

18.

Enhancing touch behavioral authentication via cost-based intelligent mechanism on smartphones

Weizhi Meng Wenjuan Li Duncan S. Wong 《Multimedia Tools and Applications》2018,77(23):30167-30185

Due to the popularity of smartphones, there is a great need to deploy appropriate authentication mechanisms to safeguard users’ sensitive data. Touch dynamics-based authentication has been developed to verify smartphone users and detect imposters. These schemes usually employ machine learning techniques to detect behavioral anomalies by comparing current behavioral actions with the stored normal model. However, we notice that machine learning classifiers often have an unstable performance, which would greatly reduce the system usability, i.e., causing a high false rejection. In this work, we are motivated by this challenge and design a cost-based intelligent mechanism that can choose a less costly algorithm for user authentication. In the evaluation, we conduct a user study with a total of 60 users to investigate the performance of our mechanism with a lightweight touch gesture-based scheme on smartphones. Experimental results demonstrate that our approach can help achieve a relatively higher and more stable authentication accuracy, as compared to the use of a sole classifier. 相似文献

19.

基于Box-Cox转换的集成跨项目软件缺陷预测方法

王莉萍陈翔王秋萍赵英全《计算机应用研究》2017,34(7)

软件缺陷预测通过预先识别出被测项目内的潜在缺陷程序模块,可以优化测试资源的分配并提高软件产品的质量。论文对跨项目缺陷预测问题展开了深入研究,在源项目实例选择时,考虑了三种不同的实例相似度计算方法,并发现这些方法的缺陷预测结果存在多样性,因此提出了一种基于Box-Cox转换的集成跨项目软件缺陷预测方法BCEL,具体来说,首先基于不同的实例相似度计算方法,从候选集中选出不同的训练集,随后针对这些数据集,进行针对性的Box-Cox转化,并借助特定分类方法构造出不同的基分类器,最后将这三个基分类器进行有效集成。基于实际项目的数据集,验证了BCEL方法的有效性,并深入分析了BCEL方法内的影响因素对缺陷预测性能的影响。相似文献

20.

Discretization for naive-Bayes learning: managing discretization bias and variance 总被引：4，自引：0，他引：4

Ying Yang Geoffrey I. Webb 《Machine Learning》2009,74(1):39-74

Quantitative attributes are usually discretized in Naive-Bayes learning. We establish simple conditions under which discretization is equivalent to use of the true probability density function during naive-Bayes learning. The use of different discretization techniques can be expected to affect the classification bias and variance of generated naive-Bayes classifiers, effects we name discretization bias and variance. We argue that by properly managing discretization bias and variance, we can effectively reduce naive-Bayes classification error. In particular, we supply insights into managing discretization bias and variance by adjusting the number of intervals and the number of training instances contained in each interval. We accordingly propose proportional discretization and fixed frequency discretization, two efficient unsupervised discretization methods that are able to effectively manage discretization bias and variance. We evaluate our new techniques against four key discretization methods for naive-Bayes classifiers. The experimental results support our theoretical analyses by showing that with statistically significant frequency, naive-Bayes classifiers trained on data discretized by our new methods are able to achieve lower classification error than those trained on data discretized by current established discretization methods. 相似文献