首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
基于主题模型的半监督网络文本情感分类研究   总被引:1,自引:0,他引:1  
针对网络评论文本的情感分类问题中存在的数据的不平衡性、无标记性和不规范性问题,提出一种基于主题的闽值调整的半监督学习模型,通过从非结构化文本中提取主题特征,对少量标注情感的文本训练分类器并优化指标调整闽值,达到识别用户评论的情感倾向的目的。仿真研究证明阈值调整的半监督模型对数据非平衡性和无标记性具有较强的适应能力。在实证研究中,对酒店评论文本数据构建的文本情感分类器显示该模型可以有效预测少数类评论样本的情感极性,证实了基于主题模型的闽值调整半监督网络评论文本情感分类模型在实际问题中的适用性与可行性。  相似文献   

2.
公安案件文本语义特征提取指的是从案件文本中提取案件的作案方式等特征.从本质上说问题是一类特殊的文本分类问题.构建了基于卷积神经网络(CNN)的文本语义特征提取方法框架.构建了CNN文本分类模型;针对多标记特征提取问题,使用问题转换法结合CNN分类方法来提取特征;讨论了分类中不均衡数据带来的问题,改进了CNN模型中的损失函数.实证结果表明:使用的CNN模型对于文本分类的效果优于传统的支持向量机等分类模型;使用问题转换法中的二值相关法结合CNN模型进行多标记语义特征提取准确率较高;改进后的CNN模型更加适合于不均衡数据的分类,宏平均F1值有了显著的提升.  相似文献   

3.
基于加权复杂网络的文本关键词提取   总被引:2,自引:0,他引:2  
通过分析基于复杂网络的网络提取的文本关键词新算法.首先根据文本特征词之间的关系构建文本的加权复杂网络模型,其次通过节点的加权聚类系数和节点的介数计算节点的综合特征值,最后根据综合特征值提取出文本关键词.实验结果表明,该算法提取的关键词能够较好地体现文本主题,提取关键词的准确率比已有算法有明显提高.  相似文献   

4.
主要研究不同的分词模式对文本分类结果的影响,采用两种传统的文本表示方法:LDA和LSA,采用两种分类方法:支持向量机和逻辑回归,一共四组不同的实验来比较分析.实验结果表明相对于传统的分词方法来说,第二种搜索引擎式的分词方法通过拆分、添加组合词对分类结果更有效.具体来说,对两种分词采用LDA得到文本表示后,模式二的分类准确率最高95.38%,模式一为93.7%.在对两种分词采用LSA得到文本表示后,模式二的分类准确率最高为96.44%,模式一最高为95.2%.  相似文献   

5.
产品垃圾评论在一定程度上影响了评论信息的参考价值,本文旨在建立识别模型将垃圾评论从评论文本中剔除,保留真实的产品评论。首先,分析了产品评论的特点,从数据搜集、文本预处理、互信息检验、文本表示4个模块提取了14个特征。然后,利用高互补性建立了基于KNN和Bayes算法的组合分类器模型。最后,利用交叉验证对iPhone 6Plus的产品评论进行检验,得到评价指标分别为:正确识别率75.3%、召回率82.1%以及F1值77.5%.  相似文献   

6.
识别用户的购买意图是提升电子商务购买率(PR)的重要方法之一。针对用户购买意图不明确的现象,提出一种新模型。该模型将训练后的Word2Vec(WV)词向量馈入卷积神经网络(CNN),通过深层语义模型(DSSM)进一步提取文本特征。在Keras框架下结合美国建材电商网站家得宝的真实搜索数据进行实证分析。结果表明,在五分类问题中,新模型在测试数据集上的F1-score达80.6%。新模型使用了Word2Vec与CNN提取文本特征,并应用DSSM模型进一步提取了用户检索与商品描述文档在高维空间中的特征表示,最大化利用了用户检索与正确商品描述之间的语义相似度,同时避免了特征提取时主观因素的干扰,提高了商品购买意图的识别效果。  相似文献   

7.
针对肿瘤的早期诊断,提出了一种基于提升小波变换的特征提取的方法,对肿瘤数据样本进行分析鉴别.该方法利用提升小波变换对190例肝癌(包括对照)和107例肺癌(包括对照)基因表达谱芯片数据进行处理后,提取信号的低频信息,经支持向量机训练学习,构造分类器模型,用于癌和非癌样本的区分甄别.实验结果表明,经提升小波变换提取的特征基因,送入分类器中能得到较高的分类率,且在支持向量机中选取线性核函数或径向基函数都能达到较好的分类效果.通过随机选取的20例基因表达谱芯片样本,对所建立的模型进行了测试,获得了很好的效果,因此,本文提出的方法对肿瘤的诊断有一定的应用意义.  相似文献   

8.
传统针对文本数据的分析,往往基于词频、词频逆文本统计量作为文本的表示特征.这类方法往往只反映了文本的部分信息,忽略了文本的内在语义特征.本文研究了中文词语衔接的概率语言模型,其基本思想在于根据文本中词语出现的先后顺序进行建模分析,该模型在短文本数据挖掘中能够很好地针对文本语义进行量化分析.主要解决两类问题:一、如何合理地将中文词转化为数字向量,并且保证中文近义词在数字空间特征上的相似性;二、如何建立恰当的向量空间,将中文文本的语义和结构特征等信息保留在向量空间中.最后结合某城市房屋管理部门留言板的实际留言文本数据,利用BP神经网络和RNN网络两种算法,实现概率语言模型的求解.与传统文本处理方法的对比说明,本文的模型方法针对短文本语义挖掘问题具有一定的优势性.  相似文献   

9.
主要研究垃圾文本识别问题,利用苹果手机评论文本特征向量建立了SVM分类模型对垃圾文本进行识别,并与BP神经网络判别模型结果进行对比,得出苹果手机前400组训练样本的判别正确率为71%,后196组测试样本的判别正确率为70.12%.故得到,影响垃圾观点文本识别效果的主要原因为:1)评论文本的特征项的提取和文本特征空间向量求解.2)判别分类方法的选择,其中SVM文本识别效果最优.  相似文献   

10.
基于贝叶斯统计方法的两总体基因表达数据分类   总被引:1,自引:0,他引:1  
在疾病的诊断过程中,对疾病的精确分类是提高诊断准确率和疾病治愈率至 关重要的一个环节,DNA芯片技术的出现使得我们从微观的层次获得与疾病分类及诊断 密切相关的基因功能信息.但是DNA芯片技术得到的基因的表达模式数据具有多变量小 样本特点,使得分类过程极不稳定,因此我们首先筛选出表达模式发生显著性变化的基因 作为特征基因集合以减少变量个数,然后再根据此特征基因集合建立分类器对样本进行分 类.本文运用似然比检验筛选出特征基因,然后基于贝叶斯方法建立了统计分类模型,并 应用马尔科夫链蒙特卡罗(MCMC)抽样方法计算样本归类后验概率.最后我们将此模型 应用到两组真实的DNA芯片数据上,并将样本成功分类.  相似文献   

11.
This paper describes a method for periodic subject-related search based on composition of the method of keyword search and subject-related filtering with the use of text classifiers. We consider various classification algorithms from the standpoint of their efficiency in the solution of the problem under study.  相似文献   

12.
New challenges in knowledge extraction include interpreting and classifying data sets while simultaneously considering related information to confirm results or identify false positives. We discuss a data fusion algorithmic framework targeted at this problem. It includes separate base classifiers for each data type and a fusion method for combining the individual classifiers. The fusion method is an extension of current ensemble classification techniques and has the advantage of allowing data to remain in heterogeneous databases. In this paper, we focus on the applicability of such a framework to the protein phosphorylation prediction problem.  相似文献   

13.
Supervised learning methods are powerful techniques to learn a function from a given set of labeled data, the so-called training data. In this paper the support vector machines approach is applied to an image classification task. Starting with the corresponding Tikhonov regularization problem, reformulated as a convex optimization problem, we introduce a conjugate dual problem to it and prove that, whenever strong duality holds, the function to be learned can be expressed via the dual optimal solutions. Corresponding dual problems are then derived for different loss functions. The theoretical results are applied by numerically solving a classification task using high dimensional real-world data in order to obtain optimal classifiers. The results demonstrate the excellent performance of support vector classification for this particular problem.  相似文献   

14.
刘潇  王效俐 《运筹与管理》2021,30(3):104-111
对客户价值进行分类, 识别重要价值客户, 对航空公司获利至关重要。本文提出了基于k-means和邻域粗糙集的航空客户价值分类模型。首先, 从客户的当前价值和潜在价值双视角出发, 建立了航空客户综合价值评价指标体系; 之后, 采用基于Elbow的k-means方法对航空客户进行聚类, 采用邻域粗糙集方法对决策系统进行指标约简, 根据约简后的决策系统完成客户价值初筛。评估前先使用SMOTE方法消除数据的不平衡性, 而后采用网格搜索组合分类器的方法对航空客户价值分类的效果进行评估和检验。最后, 根据评估结果对航空客户价值细分。文末, 对国内某航空公司的62988条真实客户记录进行了实证分析和验证, 其中, 潜在VIP客户群的分类准确率达到了92%, 从而为航空客户价值分类提供了一种新思路。  相似文献   

15.
Social media, such as blogs and on-line forums, contain a huge amount of information that is typically unorganized and fragmented. An important issue, that has been raising importance so far, is to classify on-line texts in order to detect possible anomalies. For example on-line texts representing consumer opinions can be, not only very precious and profitable for companies, but can also represent a serious damage if they are negative or faked. In this contribution we present a novel statistical methodology rooted in the context of classical text classification, in order to address such issues. In the literature, several classifiers have been proposed, among them support vector machine and naive Bayes classifiers. These approaches are not effective when coping with the problem of classifying texts belonging to an unknown author. To this aim, we propose to employ a new method, based on the combination of classification trees with non parametric approaches, such as Kruskal?CWallis and Brunner?CDette?CMunk test. The main application of what we propose is the capability to classify an author as a new one, that is potentially trustable, or as an old one, that is potentially faked.  相似文献   

16.
Mathematical programming (MP) discriminant analysis models are widely used to generate linear discriminant functions that can be adopted as classification models. Nonlinear classification models may have better classification performance than linear classifiers, but although MP methods can be used to generate nonlinear discriminant functions, functions of specified form must be evaluated separately. Piecewise-linear functions can approximate nonlinear functions, and two new MP methods for generating piecewise-linear discriminant functions are developed in this paper. The first method uses maximization of classification accuracy (MCA) as the objective, while the second uses an approach based on minimization of the sum of deviations (MSD). The use of these new MP models is illustrated in an application to a test problem and the results are compared with those from standard MCA and MSD models.  相似文献   

17.
The Mumford-Shah energy functional is a successful image segmentation model. It is a non-convex variational problem and lacks of good initialization techniques so far. In this paper, motivated by the fact that image histogram is a combination of several Gaussian distributions, and their centers can be considered as approximations of cluster centers, we introduce a histogram-based initialization method to compute the cluster centers. With this technique, we then devise an effective multi-region Mumford-Shah image segmentation method, and adopt the recent proximal alternating minimization method to solve the minimization problem. Experiments indicate that our histogram initialization method is more robust than existing methods,and our segmentation method is very effective for both gray and color images.  相似文献   

18.
The use of boxes for pattern classification has been widespread and is a fairly natural way in which to partition data into different classes or categories. In this paper we consider multi-category classifiers which are based on unions of boxes. The classification method studied may be described as follows: find boxes such that all points in the region enclosed by each box are assumed to belong to the same category, and then classify remaining points by considering their distances to these boxes, assigning to a point the category of the nearest box. This extends the simple method of classifying by unions of boxes by incorporating a natural way (based on proximity) of classifying points outside the boxes. We analyze the generalization accuracy of such classifiers and we obtain generalization error bounds that depend on a measure of how definitive is the classification of training points.  相似文献   

19.
In this work we address a technique for effectively clustering points in specific convex sets, called homogeneous boxes, having sides aligned with the coordinate axes (isothetic condition). The proposed clustering approach is based on homogeneity conditions, not according to some distance measure, and, even if it was originally developed in the context of the logical analysis of data, it is now placed inside the framework of Supervised clustering. First, we introduce the basic concepts in box geometry; then, we consider a generalized clustering algorithm based on a class of graphs, called incompatibility graphs. For supervised classification problems, we consider classifiers based on box sets, and compare the overall performances to the accuracy levels of competing methods for a wide range of real data sets. The results show that the proposed method performs comparably with other supervised learning methods in terms of accuracy.  相似文献   

20.
Efficiently maintaining the partition induced by a set of features is an important problem in building decision‐tree classifiers. In order to identify a small set of discriminating features, we need the capability of efficiently adding and removing specific features and determining the effect of these changes on the induced classification or partition. In this paper we introduce a variety of randomized and deterministic data structures to support these operations on both general and geometrically induced set partitions. We give both Monte Carlo and Las Vegas data structures that realize near‐optimal time bounds and are practical to implement. We then provide a faster solution to this problem in the geometric setting. Finally, we present a data structure that efficiently estimates the number of partitions separating elements. © 2004 Wiley Periodicals, Inc. Random Struct. Alg., 2004  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号