首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
宏特征(即文档级特征)抽取方法是文本分类中一类典型的特征抽取方法,可以分为有监督宏特征抽取和无监督宏特征抽取。这两类宏特征抽取方法均能提高文本分类的性能。但是,同时使用两类宏特征的情况还没有被研究。该文研究了有监督宏特征和无监督宏特征融合对文本分类性能的影响。具体来讲,研究了两种有监督宏特征抽取方法,与三种无监督宏特征抽取方法,即K-means、LDA和DBN,相互融合的情况。在两个公开语料库Reuters-21578和20-Newsgroup以及一个自动构建的语料库上的对比实验表明,有监督和无监督宏特征之间的融合比单独使用有监督或者无监督宏特征的方式对文本分类更加有效。  相似文献   

2.
A parallel corpus is an essential resource for statistical machine translation (SMT) but is often not available in the required amounts for all domains and languages. An approach is presented here which aims at producing parallel corpora from available comparable corpora. An SMT system is used to translate the source-language part of a comparable corpus and the translations are used as queries to conduct information retrieval from the target-language side of the comparable corpus. Simple filters are then used to score the SMT output and the IR-returned sentence with the filter score defining the degree of similarity between the two. Using SMT system output gives the benefit of trying to correct one of the common errors by sentence tail removal. The approach was applied to Arabic–English and French–English systems using comparable news corpora and considerable improvements were achieved in the BLEU score. We show that our approach is independent of the quality of the SMT system used to make the queries, strengthening the claim of applicability of the approach for languages and domains with limited parallel corpora available to start with. We compare our approach with one of the earlier approaches and show that our approach is easier to implement and gives equally good improvements.  相似文献   

3.
训练基于序列到序列(seq2seq)的文本简化模型需要大规模平行语料库,但是规模较大且标注质量较好的语料却难以获得。为此,提出一种无监督文本简化方法,使模型的学习仅需要无标注的复杂句和简单句语料。首先,利用去噪自编码器(denoising autoencoder)分别从简单句语料和复杂句语料中学习,获取简单句的自编码器和复杂句的自编码器;然后,组合两个自编码器形成初始的文本简化模型和文本复杂化模型;最后,利用回译策略(back-translation)将无监督文本简化问题转换为监督问题,不断迭代优化文本简化模型。通过在标准数据集上的实验验证,该方法在通用指标BLEU和SARI上均优于现有无监督模型,同时在词汇级别和句法级别均有简化效果。  相似文献   

4.
5.
This paper approaches the relation classification problem in information extraction framework with different machine learning strategies, from strictly supervised to weakly supervised. A number of learning algorithms are presented and empirically evaluated on a standard data set. We show that a supervised SVM classifier using various lexical and syntactic features can achieve competitive classification accuracy. Furthermore, a variety of weakly supervised learning algorithms can be applied to take advantage of large amount of unlabeled data when labeling is expensive. Newly introduced random-subspace-based algorithms demonstrate their empirical advantage over competitors in the context of both active learning and bootstrapping.  相似文献   

6.
针对蒙汉平行语料资源比较稀缺和现有平行语料数据覆盖面少等导致的蒙汉翻译质量不佳的问题,采用跨语言多任务学习的方式对机器翻译建模。在数据预处理阶段,引入两种新的无监督预训练和一种监督预训练的方法,用于跨语言建模来学习跨语言表示,并研究三种语言预训练方法在蒙汉翻译中的效果。实验结果表明,三种跨语言预训练的模型可以显著降低低资源语言的困惑度,提高蒙汉翻译质量。  相似文献   

7.
Sentiment analysis is an active research area in today’s era due to the abundance of opinionated data present on online social networks. Semantic detection is a sub-category of sentiment analysis which deals with the identification of sentiment orientation in any text. Many sentiment applications rely on lexicons to supply features to a model. Various machine learning algorithms and sentiment lexicons have been proposed in research in order to improve sentiment categorization. Supervised machine learning algorithms and domain specific sentiment lexicons generally perform better as compared to the unsupervised or semi-supervised domain independent lexicon based approaches. The core hindrance in the application of supervised algorithms or domain specific sentiment lexicons is the unavailability of sentiment labeled training datasets for every domain. On the other hand, the performance of algorithms based on general purpose sentiment lexicons needs improvement. This research is focused on building a general purpose sentiment lexicon in a semi-supervised manner. The proposed lexicon defines word semantics based on Expected Likelihood Estimate Smoothed Odds Ratio that are then incorporated with supervised machine learning based model selection approach. A comprehensive performance comparison verifies the superiority of our proposed approach.  相似文献   

8.
Feature selection (FS) is one of the most important fields in pattern recognition, which aims to pick a subset of relevant and informative features from an original feature set. There are two kinds of FS algorithms depending on the presence of information about dataset class labels: supervised and unsupervised algorithms. Supervised approaches utilize class labels of dataset in the process of feature selection. On the other hand, unsupervised algorithms act in the absence of class labels, which makes their process more difficult. In this paper, we propose unsupervised probabilistic feature selection using ant colony optimization (UPFS). The algorithm looks for the optimal feature subset in an iterative process. In this algorithm, we utilize inter-feature information which shows the similarity between the features that leads the algorithm to decreased redundancy in the final set. In each step of the ACO algorithm, to select the next potential feature, we calculate the amount of redundancy between current feature and all those which have been selected thus far. In addition, we utilize a matrix to hold ant related pheromone which shows the rate of the co-presence of every pair of features in solutions. Afterwards, features are ranked based on a probability function extracted from the matrix; then, their m-top is returned as the final solution. We compare the performance of UPFS with 15 well-known supervised and unsupervised feature selection methods using different classifiers (support vector machine, naive Bayes, and k-nearest neighbor) on 10 well-known datasets. The experimental results show the efficiency of the proposed method compared to the previous related methods.  相似文献   

9.
Annotating named entity recognition (NER) training corpora is a costly but necessary process for supervised NER approaches. This paper presents a general framework to generate large-scale NER training data from parallel corpora. In our method, we first employ a high performance NER system on one side of a bilingual corpus. Then, we project the named entity (NE) labels to the other side according to the word level alignments. Finally, we propose several strategies to select high-quality auto-labeled NER training data. We apply our approach to Chinese NER using an English-Chinese parallel corpus. Experimental results show that our approach can collect high-quality labeled data and can help improve Chinese NER.  相似文献   

10.
在特定领域的汉英机器翻译系统开发过程中,大量新词的出现导致汉语分词精度下降,而特定领域缺少标注语料使得有监督学习技术的性能难以提高。这直接导致抽取的翻译知识中出现很多错误,严重影响翻译质量。为解决这个问题,该文实现了基于生语料的领域自适应分词模型和双语引导的汉语分词,并提出融合多种分词结果的方法,通过构建格状结构(Lattice)并使用动态规划算法得到最佳汉语分词结果。为了验证所提方法,我们在NTCIR-10的汉英数据集上进行了评价实验。实验结果表明,该文提出的融合多种分词结果的汉语分词方法在分词精度F值和统计机器翻译的BLEU值上均得到了提高。  相似文献   

11.
Exploratory analysis is an area of increasing interest in the computational linguistics arena. Pragmatically speaking, exploratory analysis may be paraphrased as natural language processing by means of analyzing large corpora of text. Concerning the analysis, appropriate means are statistics, on the one hand, and artificial neural networks, on the other hand. As a challenging application area for exploratory analysis of text corpora we may certainly identify text databases, be it information retrieval or information filtering systems. With this paper we present recent findings of exploratory analysis based on both statistical and neural models applied to legal text corpora. Concerning the artificial neural networks, we rely on a model adhering to the unsupervised learning paradigm. This choice appears naturally when taking into account the specific properties of large text corpora where one is faced with the fact that input-output-mappings as required by supervised learning models cannot be provided beforehand to a satisfying extent. This is due to the fact of the highly changing contents of text archives. In a nutshell, artificial neural networks count for their highly robust behavior regarding the parameters for model optimization. In particular, we found statistical classification techniques much more susceptible to minor parameter variations than unsupervised artificial neural networks. In this paper we describe two different lines of research in exploratory analysis. First, we use the classification methods for concept analysis. The general goal is to uncover different meanings of one and the same natural language concept. A task that, obviously, is of specific importance during the creation of thesauri. As a convenient environment to present the results we selected the legal term of neutrality, which is a perfect representative of a concept having a number of highly divergent meanings. Second, we describe the classification methods in the setting of document classification. The ultimate goal in such an application is to uncover semantic similarities of various text documents in order to increase the efficiency of an information retrieval system. In this sense, document classification has its fixed position in information retrieval research from the very beginning. Nowadays renewed massive interest in document classification may be witnessed due to the appearance of large-scale digital libraries.  相似文献   

12.
Following an obvious growth of available collections of medical images in recent years, both in number and in size, machine learning has nowadays become an important tool for solving various image-analysis-related problems, such as organ segmentation or injury/pathology detection. The potential of learning algorithms to produce models having good generalisation properties is highly dependent on model complexity and the amount of available data. Bearing in mind that complex concepts require the use of complex models, it is of paramount importance to mitigate representation complexity, where possible, therefore enabling the utilisation of simpler models for performing the same task. When dealing with image collections of quasi-symmetric organs, or imaging observations of organs taken from different quasi-symmetric perspectives, one way of reducing representation complexity would be aligning all the images in a collection for left-right or front-rear orientation. That way, a learning algorithm would not be dealing with learning redundant symmetric representations. In this paper, we study in detail the influence of such within-class variation on model complexity, and present a possible solution, that can be applied to medical-imaging computer-aided diagnosis systems. The proposed method involves compacting the data, extracting features and then learning to separate the mirror-image representation classes from one another. Two efficient approaches are considered for performing such orientation separation: a fully automated unsupervised approach and a semi-automated supervised approach. Both solutions are directly applicable to imaging data. Method performance is illustrated on two 2D and one 3D real-world publicly-available medical datasets, concerning different parts of human anatomy, and observed using different imaging techniques: colour fundus photography, mammography CT scans and volumetric knee-joint MR scans. Experimental results suggest that efficient organ-mirroring orientation-classifier models, having expected classification accuracy greater than 99%, can be estimated using either the unsupervised or the supervised approach. In the presence of noise, however, an equally good performance can be achieved only by using the supervised approach, learning from a small subset of labelled data.  相似文献   

13.
神经机器翻译在平行语料充足的任务中能取得很好的效果,然而对于资源稀缺型语种的翻译任务则往往效果不佳。汉语和越南语之间没有大规模的平行语料库,在这项翻译任务中,该文探索只使用容易获得的汉语和越南语单语语料,通过挖掘单语语料中词级别的跨语言信息,融合到无监督翻译模型中提升翻译性能;该文提出了融合EMD(Earth Mover's Distance)最小化双语词典的汉—越无监督神经机器翻译方法,首先分别训练汉语和越南语的单语词嵌入,通过最小化它们的EMD训练得到汉越双语词典,然后再将该词典作为种子词典训练汉越双语词嵌入,最后利用共享编码器的无监督机器翻译模型构建汉—越无监督神经机器翻译方法。实验表明,该方法能有效提升汉越无监督神经机器翻译的性能。  相似文献   

14.
Information extraction from unstructured, ungrammatical data such as classified listings is difficult because traditional structural and grammatical extraction methods do not apply. Previous work has exploited reference sets to aid such extraction, but it did so using supervised machine learning. In this paper, we present an unsupervised approach that both selects the relevant reference set(s) automatically and then uses it for unsupervised extraction. We validate our approach with experimental results that show our unsupervised extraction is competitive with supervised machine learning approaches, including the previous supervised approach that exploits reference sets.  相似文献   

15.
Sentiment analysis involves the detection of sentiment content of text using natural language processing. Natural language processing is a very challenging task due to syntactic ambiguities, named entity recognition, use of slangs, jargons, sarcasm, abbreviations and contextual sensitivity. Sentiment analysis can be performed using supervised as well as unsupervised approaches. As the amount of data grows, unsupervised approaches become vital as they cut down on the learning time and the requirements for availability of a labelled dataset. Sentiment lexicons provide an easy application of unsupervised algorithms for text classification. SentiWordNet is a lexical resource widely employed by many researchers for sentiment analysis and polarity classification. However, the reported performance levels need improvement. The proposed research is focused on raising the performance of SentiWordNet3.0 by using it as a labelled corpus to build another sentiment lexicon, named Senti‐CS. The part of speech information, usage based ranks and sentiment scores are used to calculate Chi‐Square‐based feature weight for each unique subjective term/part‐of‐speech pair extracted from SentiWordNet3.0. This weight is then normalized in a range of ?1 to +1 using min–max normalization. Senti‐CS based sentiment analysis framework is presented and applied on a large dataset of 50000 movie reviews. These results are then compared with baseline SentiWordNet, Mutual Information and Information Gain techniques. State of the art comparison is performed for the Cornell movie review dataset. The analyses of results indicate that the proposed approach outperforms state‐of‐the‐art classifiers.  相似文献   

16.
We set out in this study to review a vast amount of recent literature on machine learning (ML) approaches to predicting financial distress (FD), including supervised, unsupervised and hybrid supervised–unsupervised learning algorithms. Four supervised ML models including the traditional support vector machine (SVM), recently developed hybrid associative memory with translation (HACT), hybrid GA-fuzzy clustering and extreme gradient boosting (XGBoost) were compared in prediction performance to the unsupervised classifier deep belief network (DBN) and the hybrid DBN-SVM model, whereby a total of sixteen financial variables were selected from the financial statements of the publicly-listed Taiwanese firms as inputs to the six approaches. Our empirical findings, covering the 2010–2016 sample period, demonstrated that among the four supervised algorithms, the XGBoost provided the most accurate FD prediction. Moreover, the hybrid DBN-SVM model was able to generate more accurate forecasts than the use of either the SVM or the classifier DBN in isolation.  相似文献   

17.

The algorithm selection problem is defined as identifying the best-performing machine learning (ML) algorithm for a given combination of dataset, task, and evaluation measure. The human expertise required to evaluate the increasing number of ML algorithms available has resulted in the need to automate the algorithm selection task. Various approaches have emerged to handle the automatic algorithm selection challenge, including meta-learning. Meta-learning is a popular approach that leverages accumulated experience for future learning and typically involves dataset characterization. Existing meta-learning methods often represent a dataset using predefined features and thus cannot be generalized across different ML tasks, or alternatively, learn a dataset’s representation in a supervised manner and therefore are unable to deal with unsupervised tasks. In this study, we propose a novel learning-based task-agnostic method for producing dataset representations. Then, we introduce TRIO, a meta-learning approach, that utilizes the proposed dataset representations to accurately recommend top-performing algorithms for previously unseen datasets. TRIO first learns graphical representations for the datasets, using four tools to learn the latent interactions among dataset instances and then utilizes a graph convolutional neural network technique to extract embedding representations from the graphs obtained. We extensively evaluate the effectiveness of our approach on 337 datasets and 195 ML algorithms, demonstrating that TRIO significantly outperforms state-of-the-art methods for algorithm selection for both supervised (classification and regression) and unsupervised (clustering) tasks.

  相似文献   

18.
Pang  Zhiqi  Guo  Jifeng  Sun  Wenbo  Xiao  Yanbang  Yu  Ming 《Applied Intelligence》2022,52(3):2987-3001

Although the single-domain person re-identification (Re-ID) method has achieved great accuracy, the dependence on the label in the same image domain severely limits the scalability of this method. Therefore, cross-domain Re-ID has received more and more attention. In this paper, a novel cross-domain Re-ID method combining supervised and unsupervised learning is proposed, which includes two models: a triple-condition generative adversarial network (TC-GAN) and a dual-task feature extraction network (DFE-Net). We first use TC-GAN to generate labeled images with the target style, and then we combine supervised and unsupervised learning to optimize DFE-Net. Specifically, we use labeled generated data for supervised learning. In addition, we mine effective information in the target data from two perspectives for unsupervised learning. To effectively combine the two types of learning, we design a dynamic weighting function to dynamically adjust the weights of these two approaches. To verify the validity of TC-GAN, DFE-Net, and the dynamic weight function, we conduct multiple experiments on Market-1501 and DukeMTMC-reID. The experimental results show that the dynamic weight function can improve the performance of the models, and our method is better than many state-of-the-art methods.

  相似文献   

19.
Source recording device recognition is an important emerging research field in digital media forensics. The literature has mainly focused on the source recording device identification problem, whereas few studies have focused on the source recording device verification problem. Sparse representation based classification methods have shown promise for many applications. This paper proposes a source cell phone verification scheme based on sparse representation. It can be further divided into three schemes which utilize exemplar dictionary, unsupervised learned dictionary and supervised learned dictionary respectively. Specifically, the discriminative dictionary learned by supervised learning algorithm, which considers the representational and discriminative power simultaneously compared to the unsupervised learning algorithm, is utilized to further improve the performances of verification systems based on sparse representation. Gaussian supervectors (GSVs) based on MFCCs, which have shown to be effective in capturing the intrinsic characteristics of recording devices, are utilized for constructing and learning dictionary. SCUTPHONE, which is a corpus of speech recordings from 15 cell phones, is presented. Evaluation experiments are conducted on three corpora of speech recordings from cell phones and demonstrate the effectiveness of the proposed methods for cell phone verification. In addition, the influences of number of target examples in the exemplar dictionary and size of the unsupervised learned dictionary on source cell phone verification performance are also analyzed.  相似文献   

20.
近年来,深度学习算法在众多有监督学习问题上取得了卓越的成果,其在精度、效率和智能化等方面的性能远超传统机器学习算法,部分甚至超越了人类水平。当前,深度学习研究者的研究兴趣逐渐从监督学习转移到强化学习、半监督学习以及无监督学习领域。视频预测算法,因其可以利用海量无标注自然数据去学习视频的内在表征,且在机器人决策、无人驾驶和视频理解等领域具有广泛的应用价值,近两年来得到快速发展。本文论述了视频预测算法的发展背景和深度学习的发展历史,简要介绍了人体动作、物体运动和移动轨迹的预测,重点介绍了基于深度学习的视频预测的主流方法和模型,最后总结了当前该领域存在的问题和发展前景。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号