首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 296 毫秒
1.
Most of the research on text categorization has focused on classifying text documents into a set of categories with no structural relationships among them (flat classification). However, in many information repositories documents are organized in a hierarchy of categories to support a thematic search by browsing topics of interests. The consideration of the hierarchical relationship among categories opens several additional issues in the development of methods for automated document classification. Questions concern the representation of documents, the learning process, the classification process and the evaluation criteria of experimental results. They are systematically investigated in this paper, whose main contribution is a general hierarchical text categorization framework where the hierarchy of categories is involved in all phases of automated document classification, namely feature selection, learning and classification of a new document. An automated threshold determination method for classification scores is embedded in the proposed framework. It can be applied to any classifier that returns a degree of membership of a document to a category. In this work three learning methods are considered for the construction of document classifiers, namely centroid-based, naïve Bayes and SVM. The proposed framework has been implemented in the system WebClassIII and has been tested on three datasets (Yahoo, DMOZ, RCV1) which present a variety of situations in terms of hierarchical structure. Experimental results are reported and several conclusions are drawn on the comparison of the flat vs. the hierarchical approach as well as on the comparison of different hierarchical classifiers. The paper concludes with a review of related work and a discussion of previous findings vs. our findings.  相似文献   

2.
特征抽取是中文文本分类的重点和难点,文中比较了不同特征单元对分类性能的影响,将字特征与词特征相结合以期更好地表现文本特征。并在构建的实验系统中比较了不同特征单元的分类准确性,发现采用混合特征来进行分类,能得到较好的分类效果。  相似文献   

3.
在标签均衡分布且标注样本足够多的数据集上,监督式分类算法通常可以取得比较好的分类效果。然而,在实际应用中样本的标签分布通常是不均衡的,分类算法的分类性能就变得比较差。为此,结合SLDA(Supervised LDA)有监督主题模型,提出一种不均衡文本分类新算法ITC-SLDA(Imbalanced Text Categorization based on Supervised LDA)。基于SLDA主题模型,建立主题与稀少类别之间的精确映射,以提高少数类的分类精度。利用SLDA模型对未标注样本进行标注,提出一种新的未标注样本的置信度计算方法,以及类别约束的采样策略,旨在有效采样未标注样本,最终降低不均衡文本的倾斜度,提升不均衡文本的分类性能。实验结果表明,所提方法能明显提高不均衡文本分类任务中的Macro-F1和G-mean值。  相似文献   

4.
As the number of documents has been rapidly increasing in recent time, automatic text categorization is becoming a more important and fundamental task in information retrieval and text mining. Accuracy and interpretability are two important aspects of a text classifier. While the accuracy of a classifier measures the ability to correctly classify unseen data, interpretability is the ability of the classifier to be understood by humans and provide reasons why each data instance is assigned to a label. This paper proposes an interpretable classification method by exploiting the Dirichlet process mixture model of von Mises–Fisher distributions for directional data. By using the labeled information of the training data explicitly and determining automatically the number of topics for each class, the learned topics are coherent, relevant and discriminative. They help interpret as well as distinguish classes. Our experimental results showed the advantages of our approach in terms of separability, interpretability and effectiveness in classification task of datasets with high dimension and complex distribution. Our method is highly competitive with state-of-the-art approaches.  相似文献   

5.
针对文本自动分类问题,提出了一种基于模糊向量空间模型和径向基函数网络的分类方法.网络由输入层、隐层和输出层组成.输入层完成分类样本的输入,隐层提取输入样本所隐含的模式特征,将分类结果在输出层表现出来.该方法在特征提取时充分考虑了特征项在文档中的位置信息,构造出模糊特征向量,使自动分类更接近手工分类方法.以中国期刊网全文数据库部分文档数据为例验证了该方法的有效性.  相似文献   

6.
基于特征投票机制设计一种线性文本分类方法,运用信任机制理论分析文档类别对特征的信任关系,给出具体特征信任度的模型,并在Newsgroup、复旦中文分类语料、Reuters-21578 3个广泛使用且具有不同特性的语料集上与传统方法进行比较。实验结果表明,该方法分类性能优于传统方法且稳定、高效,适用于大规模文本分类任务。  相似文献   

7.
多代表点的子空间分类算法   总被引:1,自引:0,他引:1       下载免费PDF全文
多代表点近邻分类克服了传统近邻分类算法的缺点,使用以代表点为中心的模型簇构造分类模型并自动确定近邻数目.此类算法在不同类别的样本存在大量重叠时将导致模型簇数量增大,造成预测精度下降.提出了一种多代表点的子空间分类算法,将不同类别的训练样本投影到多个不同的子空间,使用子空间模型簇构造分类模型,有效分隔了不同类别样本在全空...  相似文献   

8.
文本分类为一个文档自动分配一组预定义的类别或主题。文本分类中,文档的表示对学习机的学习性能有很大的影响。以实现哈萨克语文本分类为目的,根据哈萨克语语法规则设计实现哈萨克语文本的词干提取,完成哈萨克语文本的预处理。提出基于最近支持向量机的样本距离公式,避免k参数的选定,以SVM与KNN分类算法的特殊组合算法(SV-NN)实现了哈萨克语文本的分类。结合自己构建的哈萨克语文本语料库的语料进行文本分类仿真实验,数值实验展示了提出算法的有效性并证实了理论结果。  相似文献   

9.
特征选择是维吾尔语文本分类的关键技术,对分类结果将产生直接的影响。为了提高传统信息增益在维吾尔文特征选择中的效果,在深度分析维吾尔文语种特点的基础上,提出了一种新的信息增益特征选择方法。该方法结合类词频和特征分布系数以及倒逆文档频率,对传统信息增益进行修正;引入一个备选特征分布系数来平衡类间选取的特征个数;在维吾尔文数据集上实验验证。实验结果表明,改进的算法对维吾尔文分类效果有明显的提高。  相似文献   

10.
A Fuzzy Approach to Classification of Text Documents   总被引:1,自引:0,他引:1       下载免费PDF全文
This paper discusses the classification problems of text documents. Based on the concept of the proximity degree, the set of words is partitioned into some equivalence classes.Particularly, the concepts of the semantic field and association degree are given in this paper.Based on the above concepts, this paper presents a fuzzy classification approach for document categorization. Furthermore, applying the concept of the entropy of information, the approaches to select key words from the set of words covering the classification of documents and to construct the hierarchical structure of key words are obtained.  相似文献   

11.
文本的表示与文本的特征提取是文本分类需要解决的核心问题,基于此,提出了基于改进的连续词袋模型(CBOW)与ABiGRU的文本分类模型。该分类模型把改进的CBOW模型所训练的词向量作为词嵌入层,然后经过卷积神经网络的卷积层和池化层,以及结合了注意力(Attention)机制的双向门限循环单元(BiGRU)神经网络充分提取了文本的特征。将文本特征向量输入到softmax分类器进行分类。在三个语料集中进行的文本分类实验结果表明,相较于其他文本分类算法,提出的方法有更优越的性能。  相似文献   

12.
In the present article we introduce and validate an approach for single-label multi-class document categorization based on text content features. The introduced approach uses the statistical property of Principal Component Analysis, which minimizes the reconstruction error of the training documents used to compute a low-rank category transformation matrix. Such matrix transforms the original set of training documents from a given category to a new low-rank space and then optimally reconstructs them to the original space with a minimum reconstruction error. The proposed method, called Minimizer of the Reconstruction Error (mRE) classifier, uses this property, and extends and applies it to new unseen test documents. Several experiments on four multi-class datasets for text categorization are conducted in order to test the stable and generally better performance of the proposed approach in comparison with other popular classification methods.  相似文献   

13.
基于深度信念网络的文本分类算法   总被引:2,自引:0,他引:2  
随着网络的迅猛发展,文本分类成为处理和组织大量文档数据的关键技术.目前已经有许多不同类型的神经网络应用于文本分类,并且取得良好的效果.但是,大部分模型仅采用文档的少量特征作为输入,没有考虑到足够的信息量;而当考虑到足够的特征时,又会发生维数灾难,导致模型难以训练或者训练时间大幅增加.利用深度信念网络从文本中抽取特征,并利用softmax回归分类器对抽取后的特征分类.深度信念网络不仅具有强大的学习能力,同时还能从高维的原始特征中抽取低维度高度可区分的低维特征,因此利用深度信念网络来对文本分类,不仅能够考虑到文档的足够的信息量,而且能够快速的训练.并且实验结果也表明利用深度信念网络实现文本分类的性能很好.  相似文献   

14.
提出了一种基于支持向量且能识别噪音特征的文本特征评估方法,以及一种具有自我反馈学习能力的文本分类系统。该系统能够根据分类结果识别样本噪音句子、调整分类器参数、维护和优化文本库、提高分类器性能并使分类器能适应实际情况的不断变化。实验结果表明该方法能有效改善自动文本分类系统的性能。  相似文献   

15.
Harun Uğuz 《Knowledge》2011,24(7):1024-1032
Text categorization is widely used when organizing documents in a digital form. Due to the increasing number of documents in digital form, automated text categorization has become more promising in the last ten years. A major problem of text categorization is its large number of features. Most of those are irrelevant noise that can mislead the classifier. Therefore, feature selection is often used in text categorization to reduce the dimensionality of the feature space and to improve performance. In this study, two-stage feature selection and feature extraction is used to improve the performance of text categorization. In the first stage, each term within the document is ranked depending on their importance for classification using the information gain (IG) method. In the second stage, genetic algorithm (GA) and principal component analysis (PCA) feature selection and feature extraction methods are applied separately to the terms which are ranked in decreasing order of importance, and a dimension reduction is carried out. Thereby, during text categorization, terms of less importance are ignored, and feature selection and extraction methods are applied to the terms of highest importance; thus, the computational time and complexity of categorization is reduced. To evaluate the effectiveness of dimension reduction methods on our purposed model, experiments are conducted using the k-nearest neighbour (KNN) and C4.5 decision tree algorithm on Reuters-21,578 and Classic3 datasets collection for text categorization. The experimental results show that the proposed model is able to achieve high categorization effectiveness as measured by precision, recall and F-measure.  相似文献   

16.
由于短文本长度较短,在分类时会面临数据稀疏和语义模糊等问题。提出新型图卷积网络BTM_GCN,该网络利用双项主题模型(Biterm Topic Model,BTM)在短文本数据集上训练出固定数量的文档级潜在主题,并作为一种节点嵌入到文本异构图中,再与异构图中的文档节点进行连接,最后利用图卷积网络来捕获文档、词与主题节点之间的高阶邻域信息,从而丰富文档节点的语义信息,缓解短文本语义模糊的问题。在三个英文短文本数据集上的实验结果表明,该方法相比基准模型具有较优的分类效果。  相似文献   

17.
石松  陈云 《计算机工程》2014,(2):171-174
投影寻踪可有效解决文本分类中的维数灾难问题,而投影方向优化是投影寻踪需要解决的关键问题。传统的投影寻踪方法将投影指标优化看作单目标优化问题,会使解的质量受到影响。为此,提出一种基于多目标优化的投影寻踪方法。将类别之间的距离和类别内数据的聚类紧密程度作为2个优化目标,并将投影扩展到多维,利用混沌粒子群优化算法寻找最优的投影方向。在常用文本数据集上进行实验,确定最优投影指标及维度,并比较不同分类模型的分类结果,结果表明,使用该方法能有效提高文本分类性能。  相似文献   

18.
Automated text categorization has witnessed a booming interest with the exponential growth of information and the ever-increasing needs for organizations. The underlying hierarchical structure identifies the relationships of dependence between different categories and provides valuable sources of information for categorization. Although considerable research has been conducted in the field of hierarchical document categorization, little has been done on automatic generation of topic hierarchies. In this paper, we propose the method of using linear discriminant projection to generate more meaningful intermediate levels of hierarchies in large flat sets of classes. The linear discriminant projection approach first transforms all documents onto a low-dimensional space and then clusters the categories into hier- archies accordingly. The paper also investigates the effect of using generated hierarchical structure for text classification. Our experiments show that generated hierarchies improve classification performance in most cases.  相似文献   

19.
用Boosting方法组合增强Stumps进行文本分类   总被引:11,自引:0,他引:11  
为提高文本分类的精度,Schapire和Singer尝试了一个用Boosting来组合仅有一个划分的简单决策树(Stumps)的方法.其基学习器的划分是由某个特定词项是否在待分类文档中出现决定的.这样的基学习器明显太弱,造成最后组合成的Boosting分类器精度不够理想,而且需要的迭代次数很大,因而效率很低.针对这个问题,提出由文档中所有词项来决定基学习器划分以增强基学习器分类能力的方法.它把以VSM表示的文档与类代表向量之间的相似度和某特定阈值的大小关系作为基学习器划分的标准.同时,为提高算法的收敛速度,在类代表向量的计算过程中动态引入Boosting分配给各学习样本的权重.实验结果表明,这种方法提高了用Boosting组合Stump分类器进行文本分类的性能(精度和效率),而且问题规模越大,效果越明显.  相似文献   

20.
Real-life datasets are often imbalanced, that is, there are significantly more training samples available for some classes than for others, and consequently the conventional aim of reducing overall classification accuracy is not appropriate when dealing with such problems. Various approaches have been introduced in the literature to deal with imbalanced datasets, and are typically based on oversampling, undersampling or cost-sensitive classification. In this paper, we introduce an effective ensemble of cost-sensitive decision trees for imbalanced classification. Base classifiers are constructed according to a given cost matrix, but are trained on random feature subspaces to ensure sufficient diversity of the ensemble members. We employ an evolutionary algorithm for simultaneous classifier selection and assignment of committee member weights for the fusion process. Our proposed algorithm is evaluated on a variety of benchmark datasets, and is confirmed to lead to improved recognition of the minority class, to be capable of outperforming other state-of-the-art algorithms, and hence to represent a useful and effective approach for dealing with imbalanced datasets.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号