首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Noise detection for software measurement datasets is a topic of growing interest. The presence of class and attribute noise in software measurement datasets degrades the performance of machine learning-based classifiers, and the identification of these noisy modules improves the overall performance. In this study, we propose a noise detection algorithm based on software metrics threshold values. The threshold values are obtained from the Receiver Operating Characteristic (ROC) analysis. This paper focuses on case studies of five public NASA datasets and details the construction of Naive Bayes-based software fault prediction models both before and after applying the proposed noise detection algorithm. Experimental results show that this noise detection approach is very effective for detecting the class noise and that the performance of fault predictors using a Naive Bayes algorithm with a logNum filter improves if the class labels of identified noisy modules are corrected.  相似文献   

2.
目的 机动车检测和属性识别是智能交通系统中的基本任务,现有的方法通常将检测和识别分开进行,导致以下两个问题:一是检测算法与识别任务在时序上存在耦合问题,增加了算法设计的复杂度;二是多个任务模块及其交互会增加计算负载,降低了智能交通系统的执行效率。为了解决以上问题,结合机动车辆视觉属性与检测之间的联系,提出机动车联合检测与识别方法,将检测和属性识别任务整合在一个算法框架中完成。方法 首先,将车辆颜色与类型融合到检测算法中,使用多任务学习框架对机动车的属性识别任务与定位任务建模,在检测的同时完成属性识别。进一步地,针对智能交通系统中数据分布不均匀、呈现长尾现象的问题,将多任务学习框架与在线难例挖掘算法相结合,降低该现象给模型优化带来的危害。结果 为了验证本文提出的方法,构建了拥有12 712幅图像,包含19 398辆机动车的道路车辆图像数据集。在该数据集上,使用机动车联合检测与识别算法取得了85.6%的检测精度,优于SSD (single shot detector)与Faster-RCNN检测方法。针对识别任务,本文方法对于颜色与类型属性的识别准确率分别达到了91.3%和91.8%。结论 车辆颜色和类型作为机动车的重要视觉特征,综合利用以上线索有助于提高机动车检测的效果,同时能够得到良好的属性识别性能。除此之外,使用一个高度集成的框架完成多个任务,可以提升智能交通系统的运行效率。  相似文献   

3.
An empirical study of predicting software faults with case-based reasoning   总被引:1,自引:0,他引:1  
The resources allocated for software quality assurance and improvement have not increased with the ever-increasing need for better software quality. A targeted software quality inspection can detect faulty modules and reduce the number of faults occurring during operations. We present a software fault prediction modeling approach with case-based reasoning (CBR), a part of the computational intelligence field focusing on automated reasoning processes. A CBR system functions as a software fault prediction model by quantifying, for a module under development, the expected number of faults based on similar modules that were previously developed. Such a system is composed of a similarity function, the number of nearest neighbor cases used for fault prediction, and a solution algorithm. The selection of a particular similarity function and solution algorithm may affect the performance accuracy of a CBR-based software fault prediction system. This paper presents an empirical study investigating the effects of using three different similarity functions and two different solution algorithms on the prediction accuracy of our CBR system. The influence of varying the number of nearest neighbor cases on the performance accuracy is also explored. Moreover, the benefits of using metric-selection procedures for our CBR system is also evaluated. Case studies of a large legacy telecommunications system are used for our analysis. It is observed that the CBR system using the Mahalanobis distance similarity function and the inverse distance weighted solution algorithm yielded the best fault prediction. In addition, the CBR models have better performance than models based on multiple linear regression. Taghi M. Khoshgoftaar is a professor of the Department of Computer Science and Engineering, Florida Atlantic University and the Director of the Empirical Software Engineering Laboratory. His research interests are in software engineering, software metrics, software reliability and quality engineering, computational intelligence, computer performance evaluation, data mining, and statistical modeling. He has published more than 200 refereed papers in these areas. He has been a principal investigator and project leader in a number of projects with industry, government, and other research-sponsoring agencies. He is a member of the Association for Computing Machinery, the IEEE Computer Society, and IEEE Reliability Society. He served as the general chair of the 1999 International Symposium on Software Reliability Engineering (ISSRE’99), and the general chair of the 2001 International Conference on Engineering of Computer Based Systems. Also, he has served on technical program committees of various international conferences, symposia, and workshops. He has served as North American editor of the Software Quality Journal, and is on the editorial boards of the journals Empirical Software Engineering, Software Quality, and Fuzzy Systems. Naeem Seliya received the M.S. degree in Computer Science from Florida Atlantic University, Boca Raton, FL, USA, in 2001. He is currently a Ph.D. candidate in the Department of Computer Science and Engineering at Florida Atlantic University. His research interests include software engineering, computational intelligence, data mining, software measurement, software reliability and quality engineering, software architecture, computer data security, and network intrusion detection. He is a student member of the IEEE Computer Society and the Association for Computing Machinery.  相似文献   

4.
网络安全审计数据具有很强的时间特征。提出了面向审计基于SPAD算法的严格约束的序列挖掘快速算法(Sequence mIning with Strict Constraints,SISC),它充分利用了序列数据的时间和属性相关的特征指导挖掘,并使用严格的属性模式裁减概念等价类,提高了规则的有用度。最后在真实的审计数据集上的试验结果表明, SISC的效率优于SPADE,尤其当项的个数远大于属性的个数的时候。  相似文献   

5.
Software fault prediction is a process of developing modules that are used by developers in order to help them to detect faulty classes or faulty modules in early phases of the development life cycle and to determine the modules that need more refactoring in the maintenance phase. Software reliability means the probability of failure has occurred during a period of time, so when we describe a system as not reliable, it means that it contains many errors, and these errors can be accepted in some systems, but it may lead to crucial problems in critical systems like aircraft, space shuttle, and medical systems. Therefore, locating faulty software modules is an essential step because it helps defining the modules that need more refactoring or more testing. In this article, an approach is developed by integrating genetics algorithm (GA) with support vector machine (SVM) classifier and particle swarm algorithm for software fault prediction as a stand though for better software fault prediction technique. The developed approach is applied into 24 datasets (12-NASA MDP and 12-Java open-source projects), where NASA MDP is considered as a large-scale dataset and Java open-source projects are considered as a small-scale dataset. Results indicate that integrating GA with SVM and particle swarm algorithm improves the performance of the software fault prediction process when it is applied into large-scale and small-scale datasets and overcome the limitations in the previous studies.  相似文献   

6.
Jehad Al Dallal 《Software》2013,43(6):685-704
Class cohesion metrics apply different approaches to quantify the relatedness of the attributes and methods in a class. These relations can be direct or transitive. Method invocations are among the key sources of potential transitive attribute–method relations. That is, a method is not only related to the attributes that it references, but it may also be transitively related to the attributes referenced by the methods that it invokes. A few of the existing class cohesion metrics capture this potential transitive cohesion aspect. In this paper, we classify the method invocations as direct or transitive. The definitions of the class representative models used by 16 existing low‐level design (LLD) metrics are extended to incorporate the cohesion caused by the two types of method invocations. The impact of incorporating the transitive relations because of the two types of method invocations on the cohesion values and on the ability of the LLD metrics to predict faulty classes are studied empirically. The results show that the transitive relations because of both types of method invocations featured a considerable degree of cohesion that is not captured by most of the existing LLD metrics. However, practically, incorporating transitive relations in cohesion measurement was found to be ineffective in improving the fault‐prediction powers of most of the LLD metrics. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

7.
In this study, the traffic accidents recognizing risk factors related to the environmental (climatological) conditions that are associated with motor vehicles accidents on the Konya-Afyonkarahisar highway with the aid of Geographical Information Systems (GIS) have been determined using the combination of K-means clustering (KMC)-based attribute weighting (KMCAW) and classifier algorithms including artificial neural network (ANN) and adaptive network-based fuzzy inference system (ANFIS). The dynamic segmentation process in ArcGIS9.0 from the traffic accident reports recorded by District Traffic Agency has identified the locations of the motor vehicle accidents. The attributes obtained from this system are day, temperature, humidity, weather conditions, and month of occurred traffic accidents. The traffic accident dataset comprises five attributes (day, temperature, humidity, weather conditions, and month of occurred traffic accidents) and 358 observations including 179 without accident and 179 with accident. The proposed comprises two stages. In the first stage, the all attributes of dataset have been weighted using KMCAW method. The aims of this weighting method are both to increase the classification performance of used classifier algorithm and to transform from linearly non-separable traffic accidents dataset to a linearly separable dataset. In the second stage, after weighting process, ANN and ANFIS classifier algorithms have been separately used to determine the case of traffic accidents as with accident or without accident. In order to evaluate the performance of proposed method, the classification accuracy, sensitivity, specificity and area under the ROC (Receiver Operating Characteristic) curves (AUC) values have been used. While ANN and ANFIS classifiers obtained the overall prediction accuracies of 53.93 and 38.76%, respectively, the combination of KMCAW and ANN and the combination of KMCAW and ANFIS achieved the overall prediction accuracies of 74.15 and 55.06% on the prediction of traffic accidents. The experimental results have demonstrated that the proposed attribute weighting method called KMCAW is a robust and effective data pre-processing method in the prediction of traffic accidents on Konya-Afyonkarahisar highway in Turkey.  相似文献   

8.
王晓鹏 《计算机仿真》2020,37(1):234-238
对区间值属性数据集进行挖掘,可以有效分析出数据之间的关系。针对现有数据挖掘方法未对大规模数据进行聚类,导致挖掘过程占据内存大,挖掘精度低的问题,提出了一种新的区间值属性数据集挖掘算法。对问题定义、数据准备、数据提取、模式预测和数据聚类等模块进行详细分析,完成区间值属性数据聚类。根据聚类结果,将区间值属性数据分成多个数据集,挑选出能够支持最小支持度的项目集,将这些项目集作为频繁项集,进而提取出数据集之间的关联规则,将关联规则融入数据计算步骤,完成数据挖掘。为验证算法效果,进行仿真,结果表明,相较于传统挖掘算法,所提挖掘算法占用容量更小,挖掘精度更高。  相似文献   

9.
如何对生产环境中经代码混淆的结构化数据集的敏感属性(字段)进行自动化识别、分类分级,已成为对结构化数据隐私保护的瓶颈。提出一种面向结构化数据集的敏感属性自动化识别与分级算法,利用信息熵定义了属性敏感度,通过对敏感度聚类和属性间关联规则挖掘,将任意结构化数据集的敏感属性进行识别和敏感度量化;通过对敏感属性簇中属性间的互信息相关性和关联规则分析,对敏感属性进行分组并量化其平均敏感度,实现敏感属性的分类分级。实验表明,该算法可识别、分类、分级任意结构化数据集的敏感属性,效率和精确率更高;对比分析表明,该算法可同时实现敏感属性的识别与分级,无须预知属性特征、敏感特征字典,兼顾了属性间的相关性和关联关系。  相似文献   

10.
一种基于决策树的多属性分类方法   总被引:2,自引:0,他引:2  
赖邦传  陈晓红 《计算机工程》2005,31(5):88-89,226
通过分析对象属性的关系,在建立属性列表的基础上,简化有效属性,同时利用分组计数的方法统计属性取值的类别分布信息,提出了一个基于决策树的两阶段多属性分类算法,可以有效地提高发现分类规则的准确性。最后给出了相应的具体算法。  相似文献   

11.
Software quality engineering comprises of several quality assurance activities such as testing, formal verification, inspection, fault tolerance, and software fault prediction. Until now, many researchers developed and validated several fault prediction models by using machine learning and statistical techniques. There have been used different kinds of software metrics and diverse feature reduction techniques in order to improve the models’ performance. However, these studies did not investigate the effect of dataset size, metrics set, and feature selection techniques for software fault prediction. This study is focused on the high-performance fault predictors based on machine learning such as Random Forests and the algorithms based on a new computational intelligence approach called Artificial Immune Systems. We used public NASA datasets from the PROMISE repository to make our predictive models repeatable, refutable, and verifiable. The research questions were based on the effects of dataset size, metrics set, and feature selection techniques. In order to answer these questions, there were defined seven test groups. Additionally, nine classifiers were examined for each of the five public NASA datasets. According to this study, Random Forests provides the best prediction performance for large datasets and Naive Bayes is the best prediction algorithm for small datasets in terms of the Area Under Receiver Operating Characteristics Curve (AUC) evaluation parameter. The parallel implementation of Artificial Immune Recognition Systems (AIRS2Parallel) algorithm is the best Artificial Immune Systems paradigm-based algorithm when the method-level metrics are used.  相似文献   

12.
针对Elman神经网络在基于股市网络舆情的收盘价预测中存在的收敛速度慢且预测精度低的问题,提出了结合基于自适应噪声的完全集合经验模态分解(CEEMDAN)的改进鲸鱼优化算法(IWOA)结合Elman神经网络预测模型。首先,通过文本挖掘技术对上海证券交易所股票价格综合指数(SSE)180股的网络舆情进行挖掘和量化,并利用Boruta算法筛选重要属性以降低属性集的复杂度;然后,通过CEEMDAN算法在属性集中添加一定数量特定方差的白噪声,实现属性序列的分解与降噪;同时,利用自适应权重改进鲸鱼优化算法(WOA)以增强其全局搜索及局部开采能力;最后,利用WOA在迭代过程中不断优化Elman神经网络的初始权重和阈值。结果表明:比起单独使用Elman神经网络,所提模型的平均绝对误差(MAE)从358.8120降低至113.0553;与未采用CEEMDAN算法的原始数据集相比,该模型的平均绝对百分比误差(MAPE)从4.9423%降低到1.44531%,说明所提模型有效提高了预测精度,为股市网络舆情的预测提供了一种有效的实验方法。  相似文献   

13.

Software Defect Prediction (SDP) is highly crucial task in software development process to forecast about which modules are more prone to errors and faults before the instigation of the testing phase. It aims to reduce the development cost of the software by focusing the testing efforts to those predicted faulty modules. Though, it ensures in-time delivery of good quality end-product, but class-imbalance of dataset is a major hinderance to SDP. This paper proposes a novel Neighbourhood based Under-Sampling (N-US) algorithm to handle class imbalance issue. This work is dedicated to demonstrating the effectiveness of proposed Neighbourhood based Under-Sampling (N-US) approach to attain high accuracy while predicting the defective modules. The algorithm N-US under samples the dataset to maximize the visibility of minority data points while restricting the excessive elimination of majority data points to avoid information loss. To assess the applicability of N-US, it is compared with three standard under-sampling techniques. Further, this study investigates the performance of N-US as a trusted ally for SDP classifiers. Extensive experiments are conducted using benchmark datasets from NASA repository which are CM1, JM1, KC1, KC2 and PC1. The proposed SDP classifier with N-US technique is compared with baseline models statistically to assess the effectiveness of N-US algorithm for SDP. The proposed model outperforms the rest of the candidate SDP models with the highest AUC score (=?95.6%), the maximum Accuracy value (=?96.9%) and the closest ROC curve to the top left corner. It shows up with the best prediction power statistically with confidence level of 95%.

  相似文献   

14.
加权模糊关联规则的研究   总被引:1,自引:0,他引:1  
1 引言关联规则是展示属性-值频繁地在给定的数据集中一起出现的条件,最常见的是对大型超市的事务数据库进行货篮分析,文[1]提出了解决此类问题的布尔型属性关联规则的Apriori算法。数量关联在股市分析、银行存款分析和医疗诊断等众多方面都有重要应用价值。数量关联用来描述数量型属性特征之间的相互关系,用数量型关联规则来表示,如“10%年龄在50-70之间的已婚人员至少拥有两辆汽车”。文[2]首先讨论数量型关联规则,文中的挖掘算法将数量型属性划分成多个区间,但这样的方法会引起划分边界过硬的缺点。  相似文献   

15.
Electronic Commerce (EC) has offered a new channel for instant on-line shopping. However, there are too many various products available from a great number of virtual stores on the Internet for Internet shoppers to select. On-line one-to-one marketing therefore becomes a great assistance to Internet shoppers. One of the most important marketing resources is the prior daily transaction records in the database. The great amount of data not only gives the statistics, but also offers the resource of experiences and knowledge. It is quite natural that marketing managers can perform data mining on the daily transactions and treat the shoppers the way they prefer. However, the data mining on a significant amount of transaction records requires efficient tools. Data mining from automatic or semi-automatic exploration and analysis on a large amount of data items set in a database can discover significant patterns and rules underlying the database. The knowledge can be equipped in the on-line marketing system to promote Internet sales.

The purpose of this paper is to develop a mining association rules procedure from a database to support on-line recommendation. By customers and products fragmentation, product recommendation based on the hidden habits of customers in the database is therefore very meaningful. The proposed data mining procedure consists of two essential modules. One is a clustering module based on a neural network, Self-Organization Map (SOM), which performs affinity grouping tasks on a large amount of database records. The other rule is extraction module employing rough set theory that can extract association rules for each homogeneous cluster of data records and the relationships between different clusters. The implemented system was applied to a sample of sales records from a database for illustration.  相似文献   


16.
On-line discussion forums constitute communities of people learning from each other, which not only inform the students about their peers' doubts and problems but can also inform instructors about their students' knowledge of the course contents. In fact, nowadays there is increasing interest in the use of discussion forums as an indicator of student performance. In this respect, this paper proposes the use of different data mining approaches for improving prediction of students' final performance starting from participation indicators in both quantitative, qualitative and social network forums. Our objective is to determine how the selection of instances and attributes, the use of different classification algorithms and the date when data is gathered affect the accuracy and comprehensibility of the prediction. A new Moodle's module for gathering forum indicators was developed and different executions were carried out using real data from 114 university students during a first-year course in computer science. A representative set of traditional classification algorithms have been used and compared versus classification via clustering algorithms for predicting whether students will pass or fail the course on the basis of data about their forum usage. The results obtained indicate the suitability of performing both a final prediction at the end of the course and an early prediction before the end of the course; of applying clustering plus class association rules mining instead of traditional classification for obtaining highly interpretable student performance models; and of using a subset of attributes instead of all available attributes, and not all forum messages but only students' messages with content related to the subject of the course for improving classification accuracy.  相似文献   

17.
属性约简是一种重要的数据挖掘方法。为了对混合型信息系统达到更好的属性约简性能,提出一种邻域组合度量的启发式属性约简算法。邻域依赖度是构造混合信息系统属性约简的常用方法,根据粒计算的视角,在混合信息系统中提出邻域知识粒度用于评估属性的粒化能力。将邻域依赖度与邻域知识粒度进行结合,提出混合信息系统下的邻域组合度量,并将该度量方法作为启发式函数,提出一种属性约简算法。实验分析表明,该算法比混合信息系统的其他相关属性约简算法具有更高的约简性能。  相似文献   

18.
Software metrics-based quality estimation models can be effective tools for identifying which modules are likely to be fault-prone or not fault-prone. The use of such models prior to system deployment can considerably reduce the likelihood of faults discovered during operations, hence improving system reliability. A software quality classification model is calibrated using metrics from a past release or similar project, and is then applied to modules currently under development. Subsequently, a timely prediction of which modules are likely to have faults can be obtained. However, software quality classification models used in practice may not provide a useful balance between the two misclassification rates, especially when there are very few faulty modules in the system being modeled.This paper presents, in the context of case-based reasoning, two practical classification rules that allow appropriate emphasis on each type of misclassification as per the project requirements. The suggested techniques are especially useful for high-assurance systems where faulty modules are rare. The proposed generalized classification methods emphasize on the costs of misclassifications, and the unbalanced distribution of the faulty program modules. We illustrate the proposed techniques with a case study that consists of software measurements and fault data collected over multiple releases of a large-scale legacy telecommunication system. In addition to investigating the two classification methods, a brief relative comparison of the techniques is also presented. It is indicated that the level of classification accuracy and model-robustness observed for the case study would be beneficial in achieving high software reliability of its subsequent system releases. Similar observations are made from our empirical studies with other case studies.  相似文献   

19.
Many development organizations try to minimize faults in software as a means for improving customer satisfaction. Assuring high software quality often entails time-consuming and costly development processes. A software quality model based on software metrics can be used to guide enhancement efforts by predicting which modules are fault-prone. This paper presents statistical techniques to determine which predictions by a classification tree should be considered uncertain. We conducted a case study of a large legacy telecommunications system. One release was the basis for the training dataset, and the subsequent release was the basis for the evaluation dataset. We built a classification tree using the TREEDISC algorithm, which is based on 2 tests of contingency tables. The model predicted whether a module was likely to have faults discovered by customers, or not, based on software product, process, and execution metrics. We simulated practical use of the model by classifying the modules in the evaluation dataset. The model achieved useful accuracy, in spite of the very small proportion of fault-prone modules in the system. We assessed whether the classes assigned to the leaves were appropriate by statistical tests, and found sizable subsets of modules with uncertain classification. Discovering which modules have uncertain classifications allows sophisticated enhancement strategies to resolve uncertainties. Moreover, TREEDISC is especially well suited to identifying uncertain classifications.  相似文献   

20.
研究提高软件质量问题,软件质量是一种智力产品,质量度量属性较多,传统神经网络无法准确提取最优度量软件质量属性,预测准确率低。为了提高软件质量预测准确率,将遗传算法引入到软件质量度量属性选择中。首先采用遗传算法选择最优软件质量度量属性,然后将度量属性输入神经网络进行训练,建立软件质量预测模型。通过仿真对模型性能进行测试,结果表明,遗传神经网络对软件质量预测模型降低软件质量预测错误率,提高预测准确率,在理论和实际上都具有创新性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号