首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Citations in documents contain important information about the sources that authors cite and their importance and impact. Therefore, automatic identification of citations from documents is an important task. Citations included in rabbinic literature are more difficult to identify and to extract than citations in scientific papers written in English for various reasons. The aim of this novel research is to automatically identify undated citations included a unique data set: rabbinic documents written in Hebrew-Aramaic. We formulate four feature sets: orthographic, quantitative, stopword-based, and n-gram-based. Different experiments on all combinations of these feature sets using six common machine learning methods and Infogain have been performed. A combination of all four feature sets using logistic regression achieves an accuracy of 91.98%, which is an improvement of 16.53% compared to a baseline result.  相似文献   

2.
The feature of brevity in mobile phone messages makes it difficult to distinguish lexical patterns to identify spam. This paper proposes a novel approach to spam classification of extremely short messages using not only lexical features that reflect the content of a message but new stylistic features that indicate the manner in which the message is written. Experiments on two mobile phone message collections in two different languages show that the approach outperforms previous content-based approaches significantly, regardless of language.  相似文献   

3.
The rapid expansion of multimedia digital collections brings to the fore the need for classifying not only text documents but their embedded non-textual parts as well. We propose a model for basing classification of multimedia on broad, non-topical features, and show how information on targeted nearby pieces of text can be used to effectively classify photographs on a first such feature, distinguishing between indoor and outdoor images. We examine several variations to a TF*IDF-based approach for this task, empirically analyze their effects, and evaluate our system on a large collection of images from current news newsgroups. In addition, we investigate alternative classification and evaluation methods, and the effects that secondary features have on indoor/outdoor classification. Using density estimation over the raw TF*IDF values, we obtain a classification accuracy of 82%, a number that outperforms baseline estimates and earlier, image-based approaches, at least in the domain of news articles, and that nears the accuracy of humans who perform the same task with access to comparable information. Published online: 22 September 2000  相似文献   

4.
为了有效解决打印文件机源认证问题,提出了一种基于统计纹理特征选择的打印文件机源认证方法。综合考虑打印字符图像的空间域和时频域特性,将GLCM和DWT统计纹理特征进行组合,运用ReliefF算法实现组合特征的初选,二次特征选择使用SVM-RFE算法。文中实验结果表明,在英文相同字有重复样本集和中文不同字无重复样本集上的分类准确率分别为95.20%和75.00%;特征组合与特征选择有利于提高打印文件机源认证的分类鉴别性能。  相似文献   

5.
6.
信息技术的飞速发展造成了大量的文本数据累积,其中很大一部分是短文本数据。文本分类技术对于从这些海量短文中自动获取知识具有重要意义。但是由于短文中的关键词出现次数少,而且带标签的训练样本又通常数量很少,现有的一般文本挖掘算法很难得到可接受的准确度。一些基于语义的分类方法获得了较好的准确度但又由于其低效性而无法适用于海量数据。文本提出了一个新颖的短文分类算法。该算法基于文本语义特征图,并使用类似kNN的方法进行分类。实验表明该算法在对海量短文进行分类时,其准确度和性能超过其它的算法。  相似文献   

7.
Abstract

The problem of classifying clouds seen on meteorological satellite images into different types is one which requires the use of textural as well as spectral information. Since multi-spectral features are of prime importance, textural features must be considered as augmenting, rather than replacing, spectral measures. Several textural features are studied to determine their discriminating power across a number of cloud classes including those which have previously been found difficult to separate. Although several features in the frequency domain are tested they are found to be less useful than those in the spatial domain with only one exception. The specific features recommended for use in classification depend on the type of classification to be undertaken. Specifically, different features should be used for a multi-dimensional feature space analysis than for a binary-tree rule-based classification.  相似文献   

8.
基于频繁词集聚类的海量短文分类方法   总被引:1,自引:0,他引:1  
王永恒  贾焰  杨树强 《计算机工程与设计》2007,28(8):1744-1746,1780
信息技术的飞速发展造成了大量的文本数据累积,其中很大一部分是短文本数据.文本分类技术对于从这些海量短文中自动获取知识具有重要意义.但是对于关键词出现次数少的短文,现有的一般文本挖掘算法很难得到可接受的准确度.一些基于语义的分类方法获得了较好的准确度但又由于其低效性而无法适用于海量数据.针对这个问题提出了一个新颖的基于频繁词集聚类的短文分类算法.该算法使用频繁词集聚类来压缩数据,并使用语义信息进行分类.实验表明该算法在对海量短文进行分类时,其准确度和性能超过其它的算法.  相似文献   

9.
Many language identification (LID) systems are based on language models using techniques that consider the fluctuation of speech over time. Considering these fluctuations necessitates longer recording intervals to obtain reasonable accuracy. Our research extracts features from short recording intervals to enable successful classification of spoken language. The feature extraction process is based on frames of 20 ms, whereas most previous LIDs presented results based on much longer frames (3?s or longer). We defined and implemented 200 features divided into four feature sets: cepstrum features, RASTA features, spectrum features, and waveform features. We applied eight machine learning (ML) methods on the features that were extracted from a corpus containing speech files in 10 languages from the Oregon Graduate Institute (OGI) telephone speech database and compared their performances using extensive experimental evaluation. The best optimized classification results were achieved by random forest (RF): from 76.29% on 10 languages to 89.18% on 2 languages. These results are better or comparable to the state-of-the-art results for the OGI database. Another set of experiments that was performed was gender classification from 2 to 10 languages. The accuracy and the F measure values for the RF method for all the language experiments were greater than or equal to 90.05%.  相似文献   

10.
基于多元对应分析的KNN分类器组合   总被引:1,自引:0,他引:1  
本文提出一种基于多元相应分析的KNN分类器组合方法(MCA KNN),并以手写体识别 为例,用KNN分类器在同一样本集合得到的不同特征集上进行分类,再通过多元对应分析对 这些分类器的结果进行组合,以得到最终的分类结果.实验结果表明,此种分类器组合方法 能显著减少分类错误率.  相似文献   

11.
基于相似性进行文本分类是当前流行的文本处理方法。基于特征隶属度的文本分类相似性度量方法旨在利用特征与文档间的隶属关系度量文档相似性,从而实现文本分类。该方法基于特征与文档的隶属关系,对特征进行全隶属、偏隶属和无隶属词集划分,并基于3种隶属词集定义隶属度函数。全隶属词集隶属于两篇文档,隶属度随权差增大而降低;偏隶属词集仅隶属于其中某一篇文档,隶属度为一个定值;无隶属词集与两篇文档无隶属关系,隶属度为零。在度量相似性时,偏隶属关系高于全隶属关系。由于同类文档词集相近,异类文档词集差异明显,因此,基于特征与文档的隶属度进行相似性度量,可清晰界定词集与类别的隶属关系,提升分类精度。最后,采用数据集20-Newgroups和Reuters-21578对分类有效性进行验证,结果表明基于特征隶属度的相似性度量方法的性能优于目前流行的相似性度量方法。  相似文献   

12.
在多对象、多属性的评论文本中,评价对象和评价属性的缺省识别对于观点挖掘有着重要的作用。针对情感观点句中评价对象和评价属性的缺省问题,该文提出一种有效的缺省项识别方法。首先构造缺省项识别规则集,用于获取待识别的缺省项侯选集;将缺省项识别问题看作一个二元分类问题,选用词法和依存句法作为特征,使用决策树分类算法C4.5训练分类器模型,在测试集上对待识别的缺省项进行判别。实验结果表明,使用依存句法特征集分类的F值优于词法特征集约2%。将词法和依存句法两类特征融合与单类特征相比,分类精确率和F值分别提高了10%和5%左右,说明词法特征和依存句法特征的融合有利于缺省项识别。  相似文献   

13.
14.
An off-line handwriting recognition (OFHR) system is a computerized system that is capable of intelligently converting human handwritten data extracted from scanned paper documents into an equivalent text format. This paper studies a proposed OFHR for Malaysian bank cheques written in the Malay language. The proposed system comprised of three components, namely a character recognition system (CRS), a hybrid decision system and lexical word classification system. Two types of feature extraction techniques have been used in the system, namely statistical and geometrical. Experiments show that the statistical feature is reliable, accessible and offers results that are more accurate. The CRS in this system was implemented using two individual classifiers, namely an adaptive multilayer feed-forward back-propagation neural network and support vector machine. The results of this study are very promising and could generalize to the entire Malay lexical dictionary in future work toward scaled-up applications.  相似文献   

15.
基于类信息的文本特征选择与加权算法研究   总被引:3,自引:1,他引:2  
文本自动分类中特征选择和加权的目的是为了降低文本特征空间维数、去除噪音和提高分类精度。传统的特征选择方案筛选出的特征往往偏爱类分布不均匀文档集中的大类,而常用的TF·IDF特征加权方案仅考虑了特征与文档的关系,缺乏对特征与类别关系的考虑。针对上述问题,提出了基于类别信息的特征选择与加权方法,在两个不同的语料集上进行比较和分析实验,结果显示基于类别信息的特征选择与加权方法比传统方法在处理类分布不均匀的文档集时能有效提高分类精度,并且降维程度有所提高。  相似文献   

16.
针对日渐丰富的多语种文本数据,为了实现对同一类别体系下不同语种的文本分类,充分发挥多语种文本信息的价值,提出一种结合双向长短时记忆单元和卷积神经网络的多语种文本分类模型BiLSTM-CNN模型。针对每个语种,利用双向长短时记忆神经网络提取文本特征,并引入卷积神经网络进行特征优化,获得各语种更深层次的文本表示,最后将各语种的文本表示级联输入到softmax函数预测类别。在中英朝科技文献平行数据集上进行了实验验证,实验结果表明,该方法相比于基准方法分类正确率提高了4%,且对任一语种文本均能正确分类,具有良好的扩展性。  相似文献   

17.
提出一种比较XML文档这种半结构化数据流的模糊技术,并在此基础之上进行分类,主要包括基于结构的文档分类以及基于内容的文档分类。该方法建立在XML文档片段的平面编码基础之上,将XML文档表示成模糊包的形式,使用比较函数,计算出它们结构的相似性。在对XML文档进行基于结构的分类以后,可以进一步考虑其内容,以获得更细的分类。  相似文献   

18.
编辑距离是一种距离测量法,源于将一个字符串变换为另一个字符串所需要的编辑操作数,该方法能够自动将语言进行分类,最近这些年在西方很受关注,被证明测量语言或方言间距离是有效的。运用编辑距离算法对侗台语族语言做出计量分类以及亲缘关系程度的描述。结果表明编辑距离分类结果与历史语言学的分类结果是基本一致的,为计量法提供了新思路。编辑距离可以应用于东亚语言的研究中。  相似文献   

19.
提出了一种Gabor-LBP频域纹理特征与词包模型语义特征相结合的场景图像分类算法.利用Gabor变换得到的频域信息,及对应的LBP特征,与视觉词包模型(BOW)提取的语义特征自适应相融合,实现分类.为了验证本文算法,利用两个标准图像测试库进行比较测试,实验结果表明,本文算法在改善图像纹理表达上具有明显优势,特别是对于图像的光照、旋转、尺度都具有很好的鲁棒性.  相似文献   

20.
This article reports on our experiments and results on the effectiveness of different feature sets and information fusion from some combinations of them in classifying free text documents into a given number of categories. We use different feature sets and integrate neural network learning into the method. The feature sets are based on the “latent semantics” of a reference library — a collection of documents adequately representing the desired concepts. We found that a larger reference library is not necessarily better. Information fusion almost always gives better results than the individual constituent feature sets, with certain combinations doing better than the others.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号