首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this paper, we report our experience on the use of phrases as basic features in the email classification problem. We performed extensive empirical evaluation using our large email collections and tested with three text classification algorithms, namely, a naive Bayes classifier and two k-NN classifiers using TF-IDF weighting and resemblance respectively. The investigation includes studies on the effect of phrase size, the size of local and global sampling, the neighbourhood size, and various methods to improve the classification accuracy. We determined suitable settings for various parameters of the classifiers and performed a comparison among the classifiers with their best settings. Our result shows that no classifier dominates the others in terms of classification accuracy. Also, we made a number of observations on the special characteristics of emails. In particular, we observed that public emails are easier to classify than private ones.  相似文献   

2.
Many studies have tried to optimize parameters of case-based reasoning (CBR) systems. Among them, selection of appropriate features to measure similarity between the input and stored cases more precisely, and selection of appropriate instances to eliminate noises which distort prediction have been popular. However, these approaches have been applied independently although their simultaneous optimization may improve the prediction performance synergetically. This study proposes a case-based reasoning system with the two-dimensional reduction technique. In this study, vertical and horizontal dimensions of the research data are reduced through our research model, the hybrid feature and instance selection process using genetic algorithms. We apply the proposed model to a case involving real-world customer classification which predicts customers’ buying behavior for a specific product using their demographic characteristics. Experimental results show that the proposed technique may improve the classification accuracy and outperform various optimized models of the typical CBR system.  相似文献   

3.
Electronic mail is a major revolution taking place over traditional communication systems due to its convenient, economical, fast, and easy to use nature. A major bottleneck in electronic communications is the enormous dissemination of unwanted, harmful emails known as spam emails. A major concern is the developing of suitable filters that can adequately capture those emails and achieve high performance rate. Machine learning (ML) researchers have developed many approaches in order to tackle this problem. Within the context of machine learning, support vector machines (SVM) have made a large contribution to the development of spam email filtering. Based on SVM, different schemes have been proposed through text classification approaches (TC). A crucial problem when using SVM is the choice of kernels as they directly affect the separation of emails in the feature space. This paper presents thorough investigation of several distance-based kernels and specify spam filtering behaviors using SVM. The majority of used kernels in recent studies concern continuous data and neglect the structure of the text. In contrast to classical kernels, we propose the use of various string kernels for spam filtering. We show how effectively string kernels suit spam filtering problem. On the other hand, data preprocessing is a vital part of text classification where the objective is to generate feature vectors usable by SVM kernels. We detail a feature mapping variants in TC that yield improved performance for the standard SVM in filtering task. Furthermore, to cope for realtime scenarios we propose an online active framework for spam filtering. We present empirical results from an extensive study of online, transductive, and online active methods for classifying spam emails in real time. We show that active online method using string kernels achieves higher precision and recall rates.  相似文献   

4.
An empirical study of predicting software faults with case-based reasoning   总被引:1,自引:0,他引:1  
The resources allocated for software quality assurance and improvement have not increased with the ever-increasing need for better software quality. A targeted software quality inspection can detect faulty modules and reduce the number of faults occurring during operations. We present a software fault prediction modeling approach with case-based reasoning (CBR), a part of the computational intelligence field focusing on automated reasoning processes. A CBR system functions as a software fault prediction model by quantifying, for a module under development, the expected number of faults based on similar modules that were previously developed. Such a system is composed of a similarity function, the number of nearest neighbor cases used for fault prediction, and a solution algorithm. The selection of a particular similarity function and solution algorithm may affect the performance accuracy of a CBR-based software fault prediction system. This paper presents an empirical study investigating the effects of using three different similarity functions and two different solution algorithms on the prediction accuracy of our CBR system. The influence of varying the number of nearest neighbor cases on the performance accuracy is also explored. Moreover, the benefits of using metric-selection procedures for our CBR system is also evaluated. Case studies of a large legacy telecommunications system are used for our analysis. It is observed that the CBR system using the Mahalanobis distance similarity function and the inverse distance weighted solution algorithm yielded the best fault prediction. In addition, the CBR models have better performance than models based on multiple linear regression. Taghi M. Khoshgoftaar is a professor of the Department of Computer Science and Engineering, Florida Atlantic University and the Director of the Empirical Software Engineering Laboratory. His research interests are in software engineering, software metrics, software reliability and quality engineering, computational intelligence, computer performance evaluation, data mining, and statistical modeling. He has published more than 200 refereed papers in these areas. He has been a principal investigator and project leader in a number of projects with industry, government, and other research-sponsoring agencies. He is a member of the Association for Computing Machinery, the IEEE Computer Society, and IEEE Reliability Society. He served as the general chair of the 1999 International Symposium on Software Reliability Engineering (ISSRE’99), and the general chair of the 2001 International Conference on Engineering of Computer Based Systems. Also, he has served on technical program committees of various international conferences, symposia, and workshops. He has served as North American editor of the Software Quality Journal, and is on the editorial boards of the journals Empirical Software Engineering, Software Quality, and Fuzzy Systems. Naeem Seliya received the M.S. degree in Computer Science from Florida Atlantic University, Boca Raton, FL, USA, in 2001. He is currently a Ph.D. candidate in the Department of Computer Science and Engineering at Florida Atlantic University. His research interests include software engineering, computational intelligence, data mining, software measurement, software reliability and quality engineering, software architecture, computer data security, and network intrusion detection. He is a student member of the IEEE Computer Society and the Association for Computing Machinery.  相似文献   

5.
韩敏  沈力华 《控制与决策》2011,26(4):637-640
距离测度是案例检索的关键问题,它直接影响案例检索精度.针对距离测度进行研究,提出一种基于微粒群方法的自学习距离测度,并将该自学习距离测度引入案例推理中,使案例推理在处理由相关属性表述的案例时有了合理的解决方法,从而扩展了案例推理的应用范围.最后,利用实际数据与UCI数据对基于新距离测度的案例推理技术进行了仿真实验,实验结果表明,与其他方法相比,该方法可以提高案例检索的准确性.  相似文献   

6.
CBR技术在临床辅助诊断中的应用研究   总被引:1,自引:0,他引:1  
CBR是一种利用以前类似的案例(Case)来理解并解决当前问题的技术。文章介绍了CBR的技术特点,并对它在临床辅助诊断中的应用进行了研究,主要针对病例库的组织结构、相似病例的检索算法和症状权重的调整等三方面进行了探讨,并给出了相应的解决方案。  相似文献   

7.
基于范例和规则相结合的推理技术   总被引:5,自引:0,他引:5  
机器学习人员多年来提出诸多机器学习的混合体系结构,以改进机器学习的性能。本文着重提出一个基于范例推理与规则推理相结合的推理技术,以及一个范例库划分算法,其目的是充分发挥两种推理的优势,提高问题求解的效率。最后给出了一些测试结果和相关的结论。  相似文献   

8.
Hui Li  Jie Sun 《Information Sciences》2009,179(1-2):89-108
Case-based reasoning (CBR) is an easily understandable concept. Business failure prediction (BFP) is a valuable tool that can assist businesses take appropriate action when faced with the knowledge of the possibility of business failure. This study aims to improve the performance of a CBR system for BFP in terms of accuracy and reliability by constructing a new similarity measure, an area seldom researched in the domain of BFP. In order to fulfill this objective, we present a hybrid Gaussian CBR (GCBR) system and use it in BFP with empirical data in China. The heart of GCBR is similarity measure using Gaussian indicators. Feature distances between a pair of cases on each feature are transferred to Gaussian indicators by Gaussian transformations. A combiner is used to generate case similarity on the basis of the Gaussian indicators. Consensus of nearest neighbors is used to generate forecasting on the basis of case similarity. The new hybrid CBR system was empirically tested with data collected from the Shanghai Stock Exchange and Shenzhen Stock Exchange in China. We statistically validated our results by comparing them with multiple discriminant analysis, logistic regression, and two classical CBR systems. The results indicated that GCBR produces superior performance in short-term BFP of Chinese listed companies in terms of both predictive accuracy and coefficient of variation.  相似文献   

9.
案例推理属性权重的分配模型比较研究   总被引:2,自引:0,他引:2  
严爱军  钱丽敏  王普 《自动化学报》2014,40(9):1896-1902
案例推理系统中各属性权重的赋值决定了案例之间的相似度 大小,进而对推理结果的正确与否产生显著影响.以属性加权K-最近邻 相似案例检索为基础,讨论了使用注水原理分配属性权重的机理,并通过建 立权重分配的合理性指标,构造拉格朗日函数对权重进行优 化求解,得到了收敛的注水分配算法.通过五折交叉的模式分类实验 ,分别对属性权重的平均分配法、注水分配算法和遗传算法分配法进行了比较研究,案例推理分类结果证明,在引入注水分配算法后,其分类性能得到有效改善.  相似文献   

10.
Sensor signal fusion is becoming increasingly important in many areas including medical diagnosis and classification. Today, clinicians/experts often do the diagnosis of stress, sleepiness and tiredness on the basis of information collected from several physiological sensor signals. Since there are large individual variations when analyzing the sensor measurements and systems with single sensor, they could easily be vulnerable to uncertain noises/interferences in such domain; multiple sensors could provide more robust and reliable decision. Therefore, this paper presents a classification approach i.e. Multivariate Multiscale Entropy Analysis–Case-Based Reasoning (MMSE–CBR) that classifies physiological parameters of wheel loader operators by combining CBR approach with a data level fusion method named Multivariate Multiscale Entropy (MMSE). The MMSE algorithm supports complexity analysis of multivariate biological recordings by aggregating several sensor measurements e.g., Inter-beat-Interval (IBI) and Heart Rate (HR) from Electrocardiogram (ECG), Finger Temperature (FT), Skin Conductance (SC) and Respiration Rate (RR). Here, MMSE has been applied to extract features to formulate a case by fusing a number of physiological signals and the CBR approach is applied to classify the cases by retrieving most similar cases from the case library. Finally, the proposed approach i.e. MMSE–CBR has been evaluated with the data from professional drivers at Volvo Construction Equipment, Sweden. The results demonstrate that the proposed system that fuses information at data level could classify ‘stressed’ and ‘healthy’ subjects 83.33% correctly compare to an expert’s classification. Furthermore, with another data set the achieved accuracy (83.3%) indicates that it could also classify two different conditions ‘adapt’ (training) and ‘sharp’ (real-life driving) for the wheel loader operators. Thus, the new approach of MMSE–CBR could support in classification of operators and may be of interest to researchers developing systems based on information collected from different sensor sources.  相似文献   

11.
We consider bounds on the prediction error of classification algorithms based on sample compression. We refine the notion of a compression scheme to distinguish permutation and repetition invariant and non-permutation and repetition invariant compression schemes leading to different prediction error bounds. Also, we extend known results on compression to the case of non-zero empirical risk.We provide bounds on the prediction error of classifiers returned by mistake-driven online learning algorithms by interpreting mistake bounds as bounds on the size of the respective compression scheme of the algorithm. This leads to a bound on the prediction error of perceptron solutions that depends on the margin a support vector machine would achieve on the same training sample.Furthermore, using the property of compression we derive bounds on the average prediction error of kernel classifiers in the PAC-Bayesian framework. These bounds assume a prior measure over the expansion coefficients in the data-dependent kernel expansion and bound the average prediction error uniformly over subsets of the space of expansion coefficients.Editor Shai Ben-David  相似文献   

12.
基于归纳技术的范例推理及其应用   总被引:2,自引:0,他引:2  
首先研究了可以与范例推理相结合的多种技术,并着重研究了基于范例推理和归纳技术的集成方法,以充分利用范例推理和归纳技术的各自优势,提高求解问题的能力。该文提出了一个基于归纳技术的范例推理分类算法,实验证明了此算法有着良好的分类准确率。  相似文献   

13.
一种改进的案例推理分类方法研究   总被引:1,自引:0,他引:1  
张春晓  严爱军  王普 《自动化学报》2014,40(9):2015-2021
特征属性的权重分配和案例检索策略对案例推理(Case-based reasoning,CBR)分类的准确率有显著影响. 本文提出一种结合遗传算法、内省学习和群决策思想改进的CBR分类方法. 首先,利用遗传算法得到多组属性权重,再根据内省学习原理对每组权重进行迭代调整;然后,通过案例群检索策略得到满足大多数原则的群决策分类结果;最后,以典型分类数据集的对比实验证明了本文方法能进一步提高CBR分类的准确率. 这表明内省学习可以保证权重分配的合理性,案例群检索策略能充分利用案例库的潜在信息,对提升CBR的学习能力有显著作用.  相似文献   

14.
The two last decades have witnessed extensive research on multi-task learning algorithms in diverse domains such as bioinformatics, text mining, natural language processing as well as image and video content analysis. However, all existing multi-task learning methods require either domain-specific knowledge to extract features or a careful setting of many input parameters. There are many disadvantages associated with prior knowledge requirements for feature extraction or parameter-laden approaches. One of the most obvious problems is that we may find a wrong or non-existent pattern because of poorly extracted features or incorrectly set parameters. In this work, we propose a feature-free and parameter-light multi-task clustering framework to overcome these disadvantages. Our proposal is motivated by the recent successes of Kolmogorov-based methods on various applications. However, such methods are only defined for single-task problems because they lack a mechanism to share knowledge between different tasks. To address this problem, we create a novel dictionary-based compression dissimilarity measure that allows us to share knowledge across different tasks effectively. Experimental results with extensive comparisons demonstrate the generality and the effectiveness of our proposal.  相似文献   

15.
Stock selection is an important decision making problem. Trading strategies and rules based on fundamental and technical analysis can be used for decision making process. In this paper, we propose an intelligent stock selection method, which is called case-based reasoning (CBR). This technique uses the fundamental and technical indicators to identify the winning stocks around the earning announcements. CBR method is compared with other artificial intelligence techniques such as multi layer perceptron (MLP), decision trees (QUEST, Classification and Regression Trees, C5), generalized rule induction (GRI) and logistic regression. We show that the performance of CBR is better than the performance of other techniques in terms of classification accuracy, average return, Sharpe ratio and ideal profit.  相似文献   

16.
In case-based reasoning (CBR) classification systems, the similarity metrics play a key role and directly affect the system's performance. Based on our previous work on the learning pseudo metrics (LPM), we propose a case-based reasoning method for pattern classification, where the widely used Euclidean distance is replaced by the LPM to measure the closeness between the target case and each source case. The same type of case as the target case can be retrieved and the category of the target case can be defined by using the majority of reuse principle. Experimental results over some benchmark datasets and a fault diagnosis of the Tennessee-Eastman (TE) process demonstrate that the proposed reasoning techniques in this paper can effectively improve the classification accuracy, and the LPM-based retrieval method can substantially improve the quality and learning ability of CBR classifiers.  相似文献   

17.
Case-based reasoning (CBR) is a type of problem solving technique which uses previous cases to solve new, unseen and different problems. Although a larger number of cases in the memory can improve the coverage of the problem space, the retrieval efficiency will be downgraded if the size of the case-base grows to an unacceptable level. In CBR systems, the tradeoff between the number of cases stored in the case-base and the retrieval efficiency is a critical issue. This paper addresses the problem of case-base maintenance by developing a new technique, the association-based case reduction technique (ACRT), to reduce the size of the case-base in order to enhance the efficiency while maintaining or even improving the accuracy of the CBR. The experiments on 12 UCI datasets and an actual case from Taiwan’s hospital have shown superior generalization accuracy for CBR with ACRT (CBR-ACRT) as well as a greater solving efficiency.  相似文献   

18.
Dynamic time warping (DTW) has proven itself to be an exceptionally strong distance measure for time series. DTW in combination with one-nearest neighbor, one of the simplest machine learning methods, has been difficult to convincingly outperform on the time series classification task. In this paper, we present a simple technique for time series classification that exploits DTW’s strength on this task. But instead of directly using DTW as a distance measure to find nearest neighbors, the technique uses DTW to create new features which are then given to a standard machine learning method. We experimentally show that our technique improves over one-nearest neighbor DTW on 31 out of 47 UCR time series benchmark datasets. In addition, this method can be easily extended to be used in combination with other methods. In particular, we show that when combined with the symbolic aggregate approximation (SAX) method, it improves over it on 37 out of 47 UCR datasets. Thus the proposed method also provides a mechanism to combine distance-based methods like DTW with feature-based methods like SAX. We also show that combining the proposed classifiers through ensembles further improves the performance on time series classification.  相似文献   

19.
严爱军  魏志远 《计算机应用》2021,41(4):1071-1077
由于特征权重分配以及案例库维护对案例推理(CBR)分类器的性能有重要影响,提出了用蚁狮(ALO)算法来分配权重且用高斯混合模型的期望最大化算法(GMMEM)进行案例库维护的案例推理算法模型——AGECBR(Ant Lion and Expectation Maximization of Gaussian Mixture Model Case-Based Reasoning)。首先采用蚁狮算法对特征权重进行分配,在这个过程中将案例推理分类准确率作为蚁狮算法对特征权重进行迭代寻优的适应度函数,以此实现特征权重的优化分配;然后,使用高斯混合模型的期望最大化算法对案例库中的各案例进行聚类分析,并删除其中的噪声案例和冗余案例,从而实现案例库的维护。在UCI标准数据集上进行了实验,所提模型AGECBR比反向传播(BP)、k-近邻(kNN)等分类算法平均分类准确率提升了3.83~5.44个百分点。实验结果表明,AGECBR能够使案例推理分类准确率得到有效改进。  相似文献   

20.
A suffix tree approach to anti-spam email filtering   总被引:1,自引:0,他引:1  
We present an approach to email filtering based on the suffix tree data structure. A method for the scoring of emails using the suffix tree is developed and a number of scoring and score normalisation functions are tested. Our results show that the character level representation of emails and classes facilitated by the suffix tree can significantly improve classification accuracy when compared with the currently popular methods, such as naive Bayes. We believe the method can be extended to the classification of documents in other domains. Editor: Tom Fawcett  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号