首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 171 毫秒
1.
张军  王素格 《计算机科学》2016,43(7):234-239
跨领域文本情感分类已成为自然语言处理领域的一个研究热点。针对传统主动学习不能利用领域间的相关信息以及词袋模型不能过滤与情感分类无关的词语,提出了一种基于逐步优化分类模型的跨领域文本情感分类方法。首先选择源领域和目标领域的公共情感词作为特征,在源领域上训练分类模型,再对目标领域进行初始类别标注,选择高置信度的文本作为分类模型的初始种子样本。为了加快目标领域的分类模型的优化速度,在每次迭代时,选取低置信度的文本供专家标注,将标注的结果与高置信度文本共同加入训练集,再根据情感词典、评价词搭配抽取规则以及辅助特征词从训练集中动态抽取特征集。实验结果表明,该方法不仅有效地改善了跨领域情感分类效果,而且在一定程度上降低了人工标注样本的代价。  相似文献   

2.
梁喜涛  顾磊 《计算机科学》2015,42(6):228-232, 261
分词是中文自然语言处理中的一项关键基础技术.为了解决训练样本不足以及获取大量标注样本费时费力的问题,提出了一种基于最近邻规则的主动学习分词方法.使用新提出的选择策略从大量无标注样本中选择最有价值的样本进行标注,再把标注好的样本加入到训练集中,接着使用该集合来训练分词器.最后在PKU数据集、MSR数据集和山西大学数据集上进行测试,并与传统的基于不确定性的选择策略进行比较.实验结果表明,提出的最近邻主动学习方法在进行样本选择时能够选出更有价值的样本,有效降低了人工标注的代价,同时还提高了分词结果的准确率.  相似文献   

3.
提出了一种基于线性规划分类器的相关反馈方法.所设计的线性规划分类器将特征选择和分类学习结合起来,使其不仅能在利用用户标注的小样本条件下进行实时训练,而且能根据样本对分类的贡献程度选择用户反馈中的敏感特征,从而能在相关反馈小样本训练条件下有效捕捉用户的反馈意图.针对草图检索的实验结果验证了所提出方法在相关反馈中的有效性.  相似文献   

4.
王浩  孙超利  张国晨 《控制与决策》2023,38(12):3317-3326
模型管理,特别是训练样本的选择和填充采样准则,是影响昂贵多目标优化算法求解性能的重要因素.为此,选择样本库中具有较好目标函数值的若干个体作为样本训练目标函数的代理模型,使用基于参考向量的进化算法搜索模型的最优解集,并提出一种基于个体目标函数估值不确定度排序顺序均值的采样策略,从该最优解集中选择两个个体进行真实的目标函数评价.为了验证算法的有效性,将所提出算法在DTLZ和WFG多目标优化测试问题和两个实际工程优化问题上进行测试,并与其他5种优秀的同类型算法进行结果对比.实验结果表明,所提出算法在求解昂贵高维多目标优化问题上是有效的.  相似文献   

5.
基于集成学习的半监督情感分类方法研究   总被引:1,自引:0,他引:1  
情感分类旨在对文本所表达的情感色彩类别进行分类的任务。该文研究基于半监督学习的情感分类方法,即在很少规模的标注样本的基础上,借助非标注样本提高情感分类性能。为了提高半监督学习能力,该文提出了一种基于一致性标签的集成方法,用于融合两种主流的半监督情感分类方法:基于随机特征子空间的协同训练方法和标签传播方法。首先,使用这两种半监督学习方法训练出的分类器对未标注样本进行标注;其次,选取出标注一致的未标注样本;最后,使用这些挑选出的样本更新训练模型。实验结果表明,该方法能够有效降低对未标注样本的误标注率,从而获得比任一种半监督学习方法更好的分类效果。  相似文献   

6.
一种利用近邻和信息熵的主动文本标注方法   总被引:1,自引:0,他引:1  
由于大规模标注文本数据费时费力,利用少量标注样本和大量未标注样本的半监督文本分类发展迅速.在半监督文本分类中,少量标注样本主要用来初始化分类模型,其合理性将影响最终分类模型的性能.为了使标注样本尽可能吻合原始数据的分布,提出一种避开选择已标注样本的K近邻来抽取下一组候选标注样本的方法,使得分布在不同区域的样本有更多的标注机会.在此基础上,为了获得更多的类别信息,在候选标注样本中选择信息熵最大的样本作为最终的标注样本.真实文本数据上的实验表明了提出方法的有效性.  相似文献   

7.
三维模型语义自动标注的目标是自动给出最适合描述模型的标注词集合,是基于文本的三维模型检索的重要环节。语义鸿沟的存在使得相似匹配技术得到的标注效果有待提高。为了在用户提供的有限模型数量和对应的标注词信息下,在自动标注过程中利用大量的未标注样本改善三维模型的标注性能,提出了一种半监督测度学习方法完成三维模型语义自动标注。该方法首先使用基于图的半监督学习方法扩展已标注模型集合,并给出扩展集合中语义标签表征模型的语义置信度,使用改进的相关成分分析方法学习马氏距离度量,依据学习到的距离和语义置信度形成多语义标注策略。在PSB(Princeton Shape Benchmark)数据集上的测试表明,该方法利用了大量未标注样本参与标注过程,取得了比较好的标注效果。  相似文献   

8.
当标注样本匮乏时,半监督学习利用大量未标注样本解决标注瓶颈的问题,但由于未标注样本和标注样本来自不同领域,可能造成未标注样本存在质量问题,使得模型的泛化能力变差,导致分类精度下降.为此,基于wordMixup方法,提出针对未标注样本进行数据增强的u-wordMixup方法,结合一致性训练框架和Mean Teacher模型,提出一种基于u-wordMixup的半监督深度学习模型(semi-supervised deep learning model based on u-wordMixup,SD-uwM).该模型利用u-wordMixup方法对未标注样本进行数据增强,在有监督交叉熵和无监督一致性损失的约束下,能够提高未标注样本质量,减少过度拟合.在AGNews、THUCNews和20 Newsgroups数据集上的对比实验结果表明,所提出方法能够提高模型的泛化能力,同时有效提高时间性能.  相似文献   

9.
深度学习中多模态模型的训练通常需要大量高质量不同类型的标注数据,如图像、文本、音频等. 然而,获取大规模的多模态标注数据是一项具有挑战性和昂贵的任务.为了解决这一问题,主动学习作为一种有效的学习范式被广泛应用,能够通过有针对性地选择最有信息价值的样本进行标注,从而降低标注成本并提高模型性能. 现有的主动学习方法往往面临着低效的数据扫描和数据位置调整问题,当索引需要进行大范围的更新时,会带来巨大的维护代价. 为解决这些问题,本文提出了一种面向多模态模型训练的高效样本检索技术So-CBI. 该方法通过感知模型训练类间边界点,精确评估样本对模型的价值;并设计了半有序的高效样本索引,通过结合数据排序信息和部分有序性,降低了索引维护代价和时间开销. 在多组多模态数据集上通过与传统主动学习训练方法实验对比,验证了So-CBI方法在主动学习下的训练样本检索问题上的有效性.  相似文献   

10.
张人之  朱焱 《计算机科学》2021,48(6):332-337
社交网络恶意用户检测作为分类任务,需要标注训练样本.但社交网络规模通常较大,标注全部样本的花费巨大.为了能在标注预算有限的情况下找出更值得标注的样本,同时充分利用未标注样本,以此提高对恶意用户的检测表现,提出了一种基于图神经网络GraphSAGE和主动学习的检测方法.该方法分为检测模块和主动学习模块两部分.受Transformer的启发,检测模块改进了GraphSAGE,扁平化其聚合节点各阶次邻居信息的过程,使高阶邻居能直接聚合到中心节点,减少了高阶邻居的信息损失;然后通过集成学习,从不同角度利用提取得到的表征,完成检测任务.主动学习模块根据集成分类的结果衡量未标注样本的价值,在样本标注阶段交替使用检测模块和主动学习模块,指导为样本标注的过程,从而更有助于模型分类的样本标注.实验阶段使用AUROC和AUPR作为评价指标,在真实的大规模社交网络数据集上验证了改进的检测模块的有效性,并分析了改进有效的原因;然后将所提方法与现有的两种同类主动学习方法进行比较,实验结果表明在标注相同数量的训练样本的情况下,所提方法挑选标注的训练样本有更好的分类表现.  相似文献   

11.
刘广灿  曹宇  许家铭  徐波 《自动化学报》2019,45(8):1455-1463
目前自然语言推理(Natural language inference,NLI)模型存在严重依赖词信息进行推理的现象.虽然词相关的判别信息在推理中占有重要的地位,但是推理模型更应该去关注连续文本的内在含义和语言的表达,通过整体把握句子含义进行推理,而不是仅仅根据个别词之间的对立或相似关系进行浅层推理.另外,传统有监督学习方法使得模型过分依赖于训练集的语言先验,而缺乏对语言逻辑的理解.为了显式地强调句子序列编码学习的重要性,并降低语言偏置的影响,本文提出一种基于对抗正则化的自然语言推理方法.该方法首先引入一个基于词编码的推理模型,该模型以标准推理模型中的词编码作为输入,并且只有利用语言偏置才能推理成功;再通过两个模型间的对抗训练,避免标准推理模型过多依赖语言偏置.在SNLI和Breaking-NLI两个公开的标准数据集上进行实验,该方法在SNLI数据集已有的基于句子嵌入的推理模型中达到最佳性能,在测试集上取得了87.60%的准确率;并且在Breaking-NLI数据集上也取得了目前公开的最佳结果.  相似文献   

12.
This paper addresses the inference of probabilistic classification models using weakly supervised learning. The main contribution of this work is the development of learning methods for training datasets consisting of groups of objects with known relative class priors. This can be regarded as a generalization of the situation addressed by Bishop and Ulusoy (2005), where training information is given as the presence or absence of object classes in each set. Generative and discriminative classification methods are conceived and compared for weakly supervised learning, as well as a non-linear version of the probabilistic discriminative models. The considered models are evaluated on standard datasets and an application to fisheries acoustics is reported. The proposed proportion-based training is demonstrated to outperform model learning based on presence/absence information and the potential of the non-linear discriminative model is shown.  相似文献   

13.
Traditional methods on creating diesel engine models include the analytical methods like multi-zone models and the intelligent based models like artificial neural network (ANN) based models. However, those analytical models require excessive assumptions while those ANN models have many drawbacks such as the tendency to overfitting and the difficulties to determine the optimal network structure. In this paper, several emerging advanced machine learning techniques, including least squares support vector machine (LS-SVM), relevance vector machine (RVM), basic extreme learning machine (ELM) and kernel based ELM, are newly applied to the modelling of diesel engine performance. Experiments were carried out to collect sample data for model training and verification. Limited by the experiment conditions, only 24 sample data sets were acquired, resulting in data scarcity. Six-fold cross-validation is therefore adopted to address this issue. Some of the sample data are also found to suffer from the problem of data exponentiality, where the engine performance output grows up exponentially along the engine speed and engine torque. This seriously deteriorates the prediction accuracy. Thus, logarithmic transformation of dependent variables is utilized to pre-process the data. Besides, a hybrid of leave-one-out cross-validation and Bayesian inference is, for the first time, proposed for the selection of hyperparameters of kernel based ELM. A comparison among the advanced machine learning techniques, along with two traditional types of ANN models, namely back propagation neural network (BPNN) and radial basis function neural network (RBFNN), is conducted. The model evaluation is made based on the time complexity, space complexity, and prediction accuracy. The evaluation results show that kernel based ELM with the logarithmic transformation and hybrid inference is far better than basic ELM, LS-SVM, RVM, BPNN and RBFNN, in terms of prediction accuracy and training time.  相似文献   

14.
Model selection – choosing the relevant variables and structures –is a central task in econometrics. Given a limited number of observations,estimation and inference depend on this choice. A frequently treatedmodel-selection problem arises in multivariate autoregressive models, wherethe problem reduces to the choice of a dynamic structure. In most applicationsthis choice is based either on some ad hoc procedure or on a search within avery small subset of all possible models. In this paper the selection isperformed using an explicit optimization approach for a given informationcriterion. Since complete enumeration of all possible lag structures isinfeasible even for moderate dimensions, the global optimization heuristic ofthreshold accepting is implemented. A simulation study compares this approachwith the standard `take all up to the kth lag' approach. It is foundthat, if the lag structure of the true model is sparse, the thresholdaccepting optimization approach gives far better approximations.  相似文献   

15.
Judging by the increasing impact of machine learning on large-scale data analysis in the last decade, one can anticipate a substantial growth in diversity of the machine learning applications for “big data” over the next decade. This exciting new opportunity, however, also raises many challenges. One of them is scaling inference within and training of graphical models. Typical ways to address this scaling issue are inference by approximate message passing, stochastic gradients, and MapReduce, among others. Often, we encounter inference and training problems with symmetries and redundancies in the graph structure. A prominent example are relational models that capture complexity. Exploiting these symmetries, however, has not been considered for scaling yet. In this paper, we show that inference and training can indeed benefit from exploiting symmetries. Specifically, we show that (loopy) belief propagation (BP) can be lifted. That is, a model is compressed by grouping nodes together that send and receive identical messages so that a modified BP running on the lifted graph yields the same marginals as BP on the original one, but often in a fraction of time. By establishing a link between lifting and radix sort, we show that lifting is MapReduce-able. Still, in many if not most situations training relational models will not benefit from this (scalable) lifting: symmetries within models easily break since variables become correlated by virtue of depending asymmetrically on evidence. An appealing idea for such situations is to train and recombine local models. This breaks long-range dependencies and allows to exploit lifting within and across the local training tasks. Moreover, it naturally paves the way for the first scalable lifted training approaches based on stochastic gradients, both in an online and a MapReduced fashion. On several datasets, the online training, for instance, converges to the same quality solution over an order of magnitude faster, simply because it starts optimizing long before having seen the entire mega-example even once.  相似文献   

16.
In this paper, fuzzy inference models for pattern classifications have been developed and fuzzy inference networks based on these models are proposed. Most of the existing fuzzy rule-based systems have difficulties in deriving inference rules and membership functions directly from training data. Rules and membership functions are obtained from experts. Some approaches use backpropagation (BP) type learning algorithms to learn the parameters of membership functions from training data. However, BP algorithms take a long time to converge and they require an advanced setting of the number of inference rules. The work to determine the number of inference rules demands lots of experiences from the designer. In this paper, self-organizing learning algorithms are proposed for the fuzzy inference networks. In the proposed learning algorithms, the number of inference rules and the membership functions in the inference rules will be automatically determined during the training procedure. The learning speed is fast. The proposed fuzzy inference network (FIN) classifiers possess both the structure and the learning ability of neural networks, and the fuzzy classification ability of fuzzy algorithms. Simulation results on fuzzy classification of two-dimensional data are presented and compared with those of the fuzzy ARTMAP. The proposed fuzzy inference networks perform better than the fuzzy ARTMAP and need less training samples.  相似文献   

17.
在机器学习领域,Adaboost算法的实用性和有效性早已被证明.然而该算法原本是为分类问题设计,因而在推荐系统领域研究问题中无法直接应用,对其应用研究相对较少.本文对Adaboost算法进行改进,通过引入阈值,将评分预测问题转化为分类问题,并利用其权重更新的思想训练模型,提出了一个针对评分预测问题的框架,可以将训练出的多个模型集成起来得到最终的评分预测,提高了预测精度.我们选取矩阵分解模型作为基本模型,实验结果表明,使用该框架可以有效提高预测精度.  相似文献   

18.
近年来,机器学习技术飞速发展,并在自然语言处理、图像识别、搜索推荐等领域得到了广泛的应用.然而,现有大量开放部署的机器学习模型在模型安全与数据隐私方面面临着严峻的挑战.本文重点研究黑盒机器学习模型面临的成员推断攻击问题,即给定一条数据记录以及某个机器学习模型的黑盒预测接口,判断此条数据记录是否属于给定模型的训练数据集....  相似文献   

19.
In this paper we present the results of a simulation study to explore the ability of Bayesian parametric and nonparametric models to provide an adequate fit to count data of the type that would routinely be analyzed parametrically either through fixed-effects or random-effects Poisson models. The context of the study is a randomized controlled trial with two groups (treatment and control). Our nonparametric approach uses several modeling formulations based on Dirichlet process priors. We find that the nonparametric models are able to flexibly adapt to the data, to offer rich posterior inference, and to provide, in a variety of settings, more accurate predictive inference than parametric models.  相似文献   

20.
对于基于关键词的图像检索,利用检索结果的视觉相似性学习二分类器有望成为改善检索结果的最有效途径之一. 为改善搜索引擎的搜索结果,本文提出一种算法框架并且基于此框架着重研究训练数据选择这一关键问题. 训练数据选择过程由两个阶段组成:1)训练数据初始化以开始分类器学习过程;2)分类器迭代学习过程中的动态数据选择. 对于初始训练数据的选择,我们探讨了基于聚类和基于排序两种方法,并且对比了自动训练数据选择与人工标注的结果. 对于动态数据选择,我们比较了支持向量机和基于最大最小后验伪概率的贝叶斯分类器的分类效果. 组合上述两个阶段的不同方法,我们得到了8种不同的算法,并将其用于谷歌搜索引擎进行基于关键词的图像检索. 实验结果证明,如何从含有噪声的搜索结果中选择训练数据是搜索结果改善的关键问题. 实验显示我们的方法能够有效的改善谷歌搜索的结果,尤其是排序在前的结果. 尽早为用户提供更相关的结果能够更大程度的减少用户逐个翻页查看结果的工作. 另外,如何使自动训练数据选择与人工标注媲美仍是需要继续研究的一个问题.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号