首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 500 毫秒
1.
王波  刘丰年 《软件》2010,31(10):44-48
为了解决传统识别技术在车牌字符识别时效率低的问题,本文提出了一种基于粗糙集高效属性约简算法的快速车牌识别技术,该方法首先根据训练样本集的特征向量建立决策表并对决策表进行二次离散化处理,然后应用粗糙集理论对决策表进行高效属性约简,最后从约简后的决策表中获取决策规则,按照规则可信度的大小进行规则的匹配。实验表明该方法有效地压缩了图像的特征数,并简化了规则匹配算法,提高了字符识别率及识别速度,在车牌字符识别中取得了较好的识别效果。  相似文献   

2.
文本聚类中,文本特征向量的高维性使得对样本统计特征的评估十分困难,所以有必要进行有效的维数约简。ISOMAP是一类新近出现的非线性维数约简方法,可以有效地对文本特征空间进行降维处理,该方法改进了样本向量之间的距离度量,用测地距离代替传统的欧式距离,将高维的文本特征数据映射到2~3维的低维可视化空间上,达到数据降维目的,实现文本数据特征可视化,并在一定程度上解决聚类数问题。最后通过实例,验证了方法的可行性。  相似文献   

3.
在应用SVM对文本进行分类时,用传统的TFIDF算法对文本特征进行选择会产生高维特征向量问题,这个问题干扰了SVM的效率和准确性,使SVM的性能下降.为了解决SVM文本分类过程中产生的这个问题,提出一种基于本体的特征项约简方法.该方法通过本体找出特征向量中具有同义关系、组成关系和上下位关系的冗余特征项,然后对它们进行合并降低特征向量的维数.试验结果表明,采用本体约简特征向量的方法改进了SVM分类器的性能.  相似文献   

4.
为提高中文文本分类的效果,提出了一种基于粗糙集理论的规则匹配方法.在对文本特征的提取过程中,对CHI统计方法进行了适当的改进,并对特征项的权值进行了缩放和离散化.结合区分矩阵实现关于粗糙集理论的属性约简和规则提取,并采用规则预检验的方法对规则匹配的决策参数进行优化,以提高中文文本分类的效果.实验结果表明改进后的规则匹配方法分类准确率更高,同时在训练数据较少的情况下也可以取得不错的效果.  相似文献   

5.
领域文本具有结构复杂、相似性高以及动态变化等特点,且存在着连续型与离散型并存的混合数据,这在一定程度上限制了知识发现方法对文本规则的挖掘效率。针对这一问题,该文提出了基于GMM与粗糙集的文本规则挖掘方法。该方法首先根据目标数据的属性类型构造信息表;然后利用高斯混合模型(GMM,Gaussian Mixture Model)聚类算法对连续数据进行聚类划分,依此对数据进行离散化及状态约简,并生成决策表;最后利用粗糙集理论对决策表进行属性约简,通过约简表对决策规则进行提取。实验结果表明: 相比于传统的方法,该文方法拥有更高的抽取精度以及较强的属性约简能力,其信息抽取的平均准确率与F1值能够达到95.0%和95.7%。  相似文献   

6.
根据决策规则在实际应用中的匹配思想,对数值型一致决策表提出了一种基于模糊聚类方法的决策表约简方法.该方法在保持决策表一致性的前提下,通过冗余度大小的调整,对具有相同决策的对象进行压缩,从而达到对决策表约简的目的.基于计算相关规则的直接分类法验证结果显示,它是有效和可行的.  相似文献   

7.
基于Win32 API的未知病毒检测   总被引:3,自引:1,他引:2  
陈亮  郑宁  郭艳华  徐明  胡永涛 《计算机应用》2008,28(11):2829-2831
提出了一个基于行为特征向量的病毒检测方法。特征向量的每一维用于表示一种恶意行为事件,每一事件由相应的Win32应用程序编程接口(API)调用及其参数表示,并实现了一个自动化行为追踪系统(Argus)用于行为特征的提取。试验中,通过对样本数据的分析,利用互信息对特征向量进行属性约简,减少特征维数。试验结果表明,约简后的模型对于发生行为事件数大于1的病毒程序仍有着较好的检测效果。  相似文献   

8.
粗糙集分类算法中的近似决策规则和规则匹配方法   总被引:1,自引:0,他引:1  
粗糙集分类算法在应用标准决策规则进行新对象分类时,经常碰到决策规则与新对象不完全匹配的情况。因此,近似决策规则和部分匹配方法常用于提高决策规则与新对象匹配的可能性。本文在概述和比较两种近似决策规则生成算法的基础上,以一个文本分类系统为例,提出了一种综合的、更有效的近似决策规则生成算法。文章还介绍了几种通用的规则匹配方法,提出了一系列实用的完全匹配和部分匹配公式。实验表明,新提出的近似决策规则生成算法和规则匹配公式能够有效地提高决策规则与新对象的匹配可能性与准确性。  相似文献   

9.
融合Log-Gabor小波和监督保局映射的人脸识别算法   总被引:3,自引:0,他引:3  
流形学习是一种非监督学习算法,其鉴别能力不如传统的维数约简算法,而且流形学习算法不能有效地消除图像中如高阶相关等冗余信息.针对这2个问题,提出一种融合Log-Gabor小波和监督保局映射的人脸识别算法.首先使用Log-Gabor小波对归一化的人脸图像进行多方向、多分辨率滤波,并提取其对应的Log-Gabor图像特征向量;然后使用监督保局映射算法对Log-Gabor特征向量进行维数约简,得到低维鉴别特征;最后使用最近邻分类器进行分类.该算法综合运用了Log-Gabor特征对人脸图像的优异的表征能力、SLPP的非线性维数约简能力,对光照变化、表情变化等具有良好的鲁棒性.在Yale和PIE人脸库上的仿真实验结果证明了文中算法的有效性.  相似文献   

10.
针对不一致性决策信息系统,提出一种使用规则分辨矩阵获取决策规则的算法.不一致性决策信息系统在约简时可能产生新的冲突规则,选择冲突规则对不变的约简并产生规则,在此基础上进行不一致性规则的合并从而生成正确的规则集.  相似文献   

11.
Applying EuroWordNet to Cross-Language Text Retrieval   总被引:1,自引:0,他引:1  
We discuss ways in which EuroWordNet (EWN) can be used in multilingual information retrieval activities, focusing on two approaches to Cross-Language Text Retrieval that use the EWN database as a large-scale multilingual semantic resource. The first approach indexes documents and queries in terms of the EuroWordNet Inter-Lingual-Index, thus turning term weighting and query/document matching into language-independent tasks. The second describes how the information in the EWN database could be integrated with a corpus-based technique, thus allowing retrieval of domain-specific terms that may not be present in our multilingual database. Our objective is to show the potential of EuroWordNet as a promising alternative to existing approaches to Cross-Language Text Retrieval.  相似文献   

12.
在信息检索中引入NLP 技术是信息检索发展的主要趋势,本文将NLP 中较为成熟的词性标注技术加入信息检索,采用大规模TREC 数据集,试图发现词性标注对检索系统性能的影响。笔者在SMART 检索系统上使用不同标注集、不同索引项权重进行了检索实验。实验表明,在信息检索中加入词性标注信息可能会对某些特定Topic 和Document 的检索效果有所改进,但词性标注的影响能力弱于索引项权重选择的影响能力。词性标注对检索性能的影响涉及到Topic 和Document 中的具体用词,普遍规律有待进一步研究。  相似文献   

13.
Support Vector Machines (SVM) has been developed for Chinese official document classification in One-against-All (OAA) multi-class scheme. Several data retrieving techniques including sentence segmentation, term weighting, and feature extraction are used in preprocess. We observe that most documents of which contents are indistinguishable make poor classification results. The traditional solution is to add misclassified documents to the training set in order to adjust classification rules. In this paper, indistinguishable documents are observed to be informative for strengthening prediction performance since their labels are predicted by the current model in low confidence. A general approach is proposed to utilize decision values in SVM to identify indistinguishable documents. Based on verified classification results and distinguishability of documents, four learning strategies that select certain documents to training sets are proposed to improve classification performance. Experiments report that indistinguishable documents are able to be identified in a high probability and are informative for learning strategies. Furthermore, LMID that adds both of misclassified documents and indistinguishable documents to training sets is the most effective learning strategy in SVM classification for large set of Chinese official documents in terms of computing efficiency and classification accuracy.  相似文献   

14.
Wang  Tao  Cai  Yi  Leung  Ho-fung  Lau  Raymond Y. K.  Xie  Haoran  Li  Qing 《Knowledge and Information Systems》2021,63(9):2313-2346

In text categorization, Vector Space Model (VSM) has been widely used for representing documents, in which a document is represented by a vector of terms. Since different terms contribute to a document’s semantics in various degrees, a number of term weighting schemes have been proposed for VSM to improve text categorization performance. Much evidence shows that the performance of a term weighting scheme often varies across different text categorization tasks, while the mechanism underlying variability in a scheme’s performance remains unclear. Moreover, existing schemes often weight a term with respect to a category locally, without considering the global distribution of a term’s occurrences across all categories in a corpus. In this paper, we first systematically examine pros and cons of existing term weighting schemes in text categorization and explore the reasons why some schemes with sound theoretical bases, such as chi-square test and information gain, perform poorly in empirical evaluations. By measuring the concentration that a term distributes across all categories in a corpus, we then propose a series of entropy-based term weighting schemes to measure the distinguishing power of a term in text categorization. Through extensive experiments on five different datasets, the proposed term weighting schemes consistently outperform the state-of-the-art schemes. Moreover, our findings shed new light on how to choose and develop an effective term weighting scheme for a specific text categorization task.

  相似文献   

15.
针对现有的空间向量模型在进行文档表示时忽略词条之间的语义关系的不足,提出了一种新的基于关联规则的文档向量表示方法。在广义空间向量模型中分析词条的频繁同现关系得到词条同现语义,根据关联规则分析词条之间的关联相关性,挖掘出文档中词条之间的潜在关联语义关系,将词条同现语义和关联语义线性加权对文档进行表示。实验结果表明,与BOW模型和GVSM模型相比,采用关联规则文档向量表示的文档聚类结果更准确。  相似文献   

16.
分析了当前Web信息检索的技术现状,指出检索效率不高的根本原因在于搜索引擎所采用的排序函数和标引词加权技术。介绍了传统的信息检索排序函数和标引词加权技术。分析了Web文档的特点,指出其主要形式HTML文档是一种结构化文档,结构由标签显式地定义,不同文档结构对检索性能的贡献不同。对本领域国内外学者的成果作了对比研究。最后探讨了Web信息检索排序函数及标引词加权技术的发展方向。  相似文献   

17.
基于粗糙集的关联规则挖掘方法   总被引:1,自引:0,他引:1  
对粗糙集进行了相关研究,并提出一种以粗糙集理论为基础的关联规则挖掘方法,该方法首先利用粗糙集的特征属性约简算法进行属性约简,然后在构建约简决策表的基础上应用改进的Apriori算法进行关联规则挖掘。该方法的优势在于消除了不重要的属性,减少了属性数目和候选项集数量,同时只需一次扫描决策表就可产生决策规则。应用实例及实验结果分析表明该方法是一种有效而且快速的关联规则挖掘方法。  相似文献   

18.
This paper proposes three feature selection algorithms with feature weight scheme and dynamic dimension reduction for the text document clustering problem. Text document clustering is a new trend in text mining; in this process, text documents are separated into several coherent clusters according to carefully selected informative features by using proper evaluation function, which usually depends on term frequency. Informative features in each document are selected using feature selection methods. Genetic algorithm (GA), harmony search (HS) algorithm, and particle swarm optimization (PSO) algorithm are the most successful feature selection methods established using a novel weighting scheme, namely, length feature weight (LFW), which depends on term frequency and appearance of features in other documents. A new dynamic dimension reduction (DDR) method is also provided to reduce the number of features used in clustering and thus improve the performance of the algorithms. Finally, k-mean, which is a popular clustering method, is used to cluster the set of text documents based on the terms (or features) obtained by dynamic reduction. Seven text mining benchmark text datasets of different sizes and complexities are evaluated. Analysis with k-mean shows that particle swarm optimization with length feature weight and dynamic reduction produces the optimal outcomes for almost all datasets tested. This paper provides new alternatives for text mining community to cluster text documents by using cohesive and informative features.  相似文献   

19.
Term weighting is a strategy that assigns weights to terms to improve the performance of sentiment analysis and other text mining tasks. In this paper, we propose a supervised term weighting scheme based on two basic factors: Importance of a term in a document (ITD) and importance of a term for expressing sentiment (ITS), to improve the performance of analysis. For ITD, we explore three definitions based on term frequency. Then, seven statistical functions are employed to learn the ITS of each term from training documents with category labels. Compared with the previous unsupervised term weighting schemes originated from information retrieval, our scheme can make full use of the available labeling information to assign appropriate weights to terms. We have experimentally evaluated the proposed method against the state-of-the-art method. The experimental results show that our method outperforms the method and produce the best accuracy on two of three data sets.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号