首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 484 毫秒
1.
Natural Language Processing (NLP) for the Arabic language has gained much significance in recent years. The most commonly-utilized NLP task is the ‘Text Classification’ process. Its main intention is to apply the Machine Learning (ML) approaches for automatically classifying the textual files into one or more pre-defined categories. In ML approaches, the first and foremost crucial step is identifying an appropriate large dataset to test and train the method. One of the trending ML techniques, i.e., Deep Learning (DL) technique needs huge volumes of different types of datasets for training to yield the best outcomes. The current study designs a new Dice Optimization with a Deep Hybrid Boltzmann Machine-based Arabic Corpus Classification (DODHBM-ACC) model in this background. The presented DODHBM-ACC model primarily relies upon different stages of pre-processing and the word2vec word embedding process. For Arabic text classification, the DHBM technique is utilized. This technique is a hybrid version of the Deep Boltzmann Machine (DBM) and Deep Belief Network (DBN). It has the advantage of learning the decisive intention of the classification process. To adjust the hyperparameters of the DHBM technique, the Dice Optimization Algorithm (DOA) is exploited in this study. The experimental analysis was conducted to establish the superior performance of the proposed DODHBM-ACC model. The outcomes inferred the better performance of the proposed DODHBM-ACC model over other recent approaches.  相似文献   

2.
文本分类是自然语言处理中一个经典的研究方向,在信息处理中扮演着重要的角色。目前深度学习已经在图像识别、机器翻译等领域取得了突破性的进展,而且它也被证明在自然语言处理任务中拥有着提取句子或文本更高层次表示的能力。本文提出一种新颖的深度学习混合模型Attention-based C-GRU用于文本分类,该模型结合CNN中的卷积层和GRU,通过引入Attention机制,突出关键词和优化特征提取过程。利用该模型去学习文本语义并且在主题分类、问题分类及情感分类等任务上对其做出评估。通过与对比模型和表现最优方法做比较,表明本文模型的有效性。  相似文献   

3.
Sentiment analysis (SA) is the procedure of recognizing the emotions related to the data that exist in social networking. The existence of sarcasm in textual data is a major challenge in the efficiency of the SA. Earlier works on sarcasm detection on text utilize lexical as well as pragmatic cues namely interjection, punctuations, and sentiment shift that are vital indicators of sarcasm. With the advent of deep-learning, recent works, leveraging neural networks in learning lexical and contextual features, removing the need for handcrafted feature. In this aspect, this study designs a deep learning with natural language processing enabled SA (DLNLP-SA) technique for sarcasm classification. The proposed DLNLP-SA technique aims to detect and classify the occurrence of sarcasm in the input data. Besides, the DLNLP-SA technique holds various sub-processes namely preprocessing, feature vector conversion, and classification. Initially, the pre-processing is performed in diverse ways such as single character removal, multi-spaces removal, URL removal, stopword removal, and tokenization. Secondly, the transformation of feature vectors takes place using the N-gram feature vector technique. Finally, mayfly optimization (MFO) with multi-head self-attention based gated recurrent unit (MHSA-GRU) model is employed for the detection and classification of sarcasm. To verify the enhanced outcomes of the DLNLP-SA model, a comprehensive experimental investigation is performed on the News Headlines Dataset from Kaggle Repository and the results signified the supremacy over the existing approaches.  相似文献   

4.
5.
As the popularity of the portable document format (PDF) file format increases, research that facilitates PDF text analysis or extraction is necessary. Heading detection is a crucial component of PDF-based text classification processes. This research involves training a supervised learning model to detect headings by systematically testing and selecting classifier features using recursive feature elimination. Results indicate that decision tree is the best classifier with an accuracy of 95.83%, sensitivity of 0.981, and a specificity of 0.946. This research into heading detection contributes to the field of PDF-based text extraction and can be applied to the automation of large scale PDF text analysis in a variety of professional and policy-based contexts.  相似文献   

6.
机器阅读理解是自然语言处理中的一项重要而富有挑战性的任务。近年来,以BERT为代表的大规模预训练语言模型在此领域取得了显著的成功。但是,受限于序列模型的结构和规模,基于BERT的阅读理解模型在长距离和全局语义构建的能力有着显著缺陷,影响了其在阅读理解任务上的表现。针对这一问题,该文提出一种融合了序列和图结构的机器阅读理解的新模型。首先,提取文本中的命名实体,使用句子共现和滑动窗口共现两种方案构建命名实体共现图;基于空间的图卷积神经网络学习命名实体的嵌入表示;将通过图结构得到的实体嵌入表示融合到基于序列结构得到的文本嵌入表示中;最终采用片段抽取的方式实现机器阅读理解问答。实验结果表明,与采用BERT所实现的基于序列结构的阅读理解模型相比,融合序列和图结构的机器阅读理解模型EM值提高了7.8%,F1值提高了6.6%。  相似文献   

7.
作为自然语言处理技术中的底层任务之一,文本分类任务对于上游任务有非常重要的辅助价值。而随着最近几年深度学习广泛应用于NLP中的上下游任务的趋势,深度学习在下游任务文本分类中性能不错。但是目前的基于深层学习网络的模型在捕捉文本序列的长距离型上下文语义信息进行建模方面仍有不足,同时也没有引入语言信息来辅助分类器进行分类。针对这些问题,提出了一种新颖的结合Bert与Bi-LSTM的英文文本分类模。该模型不仅能够通过Bert预训练语言模型引入语言信息提升分类的准确性,还能基于Bi-LSTM网络去捕捉双向的上下文语义依赖信息对文本进行显示建模。具体而言,该模型主要有输入层、Bert预训练语言模型层、Bi-LSTM层以及分类器层搭建而成。实验结果表明,与现有的分类模型相比较,所提出的Bert-Bi-LSTM模型在MR数据集、SST-2数据集以及CoLA数据集测试中达到了最高的分类准确率,分别为86.2%、91.5%与83.2%,大大提升了英文文本分类模型的性能。  相似文献   

8.
针对目前自然语言处理研究中,使用卷积神经网络(CNN)进行短文本分类任务时可以结合不同神经网络结构与分类算法以提高分类性能的问题,提出了一种结合卷积神经网络与极速学习机的CNN-ELM混合短文本分类模型。使用词向量训练构成文本矩阵作为输入数据,然后使用卷积神经网络提取特征并使用Highway网络进行特征优化,最后使用误差最小化极速学习机(EM-ELM)作为分类器完成短文本分类任务。与其他模型相比,该混合模型能够提取更具代表性的特征并能快速准确地输出分类结果。在多种英文数据集上的实验结果表明提出的CNN-ELM混合短文本分类模型比传统机器学习模型与深度学习模型更适合完成短文本分类任务。  相似文献   

9.
随着自然语言处理(NLP)的不断发展,深度学习被逐渐运用于文本分类中.然而大多数算法都未有效利用训练文本的实例信息,导致文本特征提取不全面.为了有效利用对象的实例信息,本文提出最近邻注意力和卷积神经网络的文本分类模型(CNN-AKNN).通过引入基于加权卡方距离的最近邻改进算法训练文本,构建文本对象的注意力,然后将注意力机制与卷积神经网络相结合实现全局特征与局部特征的提取,最后通过softmax函数进行文本分类.本文采用搜狗新闻语料库、中山大学语料库以及英文新闻语料库AG_news进行大量实验,结果表明本文所使用的改进算法相较于基准算法效果更优,更有利于提取模型的隐含特征.  相似文献   

10.
In recent years, huge volumes of healthcare data are getting generated in various forms. The advancements made in medical imaging are tremendous owing to which biomedical image acquisition has become easier and quicker. Due to such massive generation of big data, the utilization of new methods based on Big Data Analytics (BDA), Machine Learning (ML), and Artificial Intelligence (AI) have become essential. In this aspect, the current research work develops a new Big Data Analytics with Cat Swarm Optimization based deep Learning (BDA-CSODL) technique for medical image classification on Apache Spark environment. The aim of the proposed BDA-CSODL technique is to classify the medical images and diagnose the disease accurately. BDA-CSODL technique involves different stages of operations such as preprocessing, segmentation, feature extraction, and classification. In addition, BDA-CSODL technique also follows multi-level thresholding-based image segmentation approach for the detection of infected regions in medical image. Moreover, a deep convolutional neural network-based Inception v3 method is utilized in this study as feature extractor. Stochastic Gradient Descent (SGD) model is used for parameter tuning process. Furthermore, CSO with Long Short-Term Memory (CSO-LSTM) model is employed as a classification model to determine the appropriate class labels to it. Both SGD and CSO design approaches help in improving the overall image classification performance of the proposed BDA-CSODL technique. A wide range of simulations was conducted on benchmark medical image datasets and the comprehensive comparative results demonstrate the supremacy of the proposed BDA-CSODL technique under different measures.  相似文献   

11.
Sentiment analysis (SA) research has increased tremendously in recent times. SA aims to determine the sentiment orientation of a given text into positive or negative polarity. Motivation for SA research is the need for the industry to know the opinion of the users about their product from online portals, blogs, discussion boards and reviews and so on. Efficient features need to be extracted for machine-learning algorithm for better sentiment classification. In this paper, initially various features are extracted such as unigrams, bi-grams and dependency features from the text. In addition, new bi-tagged features are also extracted that conform to predefined part-of-speech patterns. Furthermore, various composite features are created using these features. Information gain (IG) and minimum redundancy maximum relevancy (mRMR) feature selection methods are used to eliminate the noisy and irrelevant features from the feature vector. Finally, machine-learning algorithms are used for classifying the review document into positive or negative class. Effects of different categories of features are investigated on four standard data-sets, namely, movie review and product (book, DVD and electronics) review data-sets. Experimental results show that composite features created from prominent features of unigram and bi-tagged features perform better than other features for sentiment classification. mRMR is a better feature selection method as compared with IG for sentiment classification. Boolean Multinomial Naïve Bayes) algorithm performs better than support vector machine classifier for SA in terms of accuracy and execution time.  相似文献   

12.
随着网络购物的高速发展,网络商家和购物者在网络交易活动中产生了大量的交易数据,其中蕴含着巨大的分析价值。针对社交电商商品文本的文本分类问题,为了更加高效准确地判断文本所描述商品的类别,提出了一种基于BERT模型的社交电商文本分类算法。首先,该算法采用BERT(Bidirectional Encoder Representations from Transformers)预训练语言模型来完成社交电商文本的句子层面的特征向量表示,随后有针对性地将获得的特征向量输入分类器进行分类,最后采用社交电商文本的数据集进行算法验证。实验结果表明,经过训练的模型在测试集上的分类结果F1值最高可达94.61%,高出BERT模型针对MRPC的分类任务6%。因此,所提社交电商文本分类算法能够较为高效准确地判断文本所描述商品的类别,有助于进一步分析网络交易数据,从海量数据中提取有价值的信息。  相似文献   

13.
事件抽取是构建知识图谱的关键任务之一,也是当前自然语言处理的研究热点和难点问题。事件抽取研究从非结构化的自然语言文本中自动抽取用户感兴趣的事件信息,对人们认知世界有着深远的意义,在信息检索、智能问答、情感分析等应用场景有着重要的意义和价值。在公开国际测评和语料的推动下,事件抽取研究受到越来越多的学者关注,取得了许多的研究成果。按照事件抽取任务定义,有预先定义结构化的事件表示框架的框架表示事件抽取和通过事件实例中触发词及事件元素进行聚类的实例表示事件抽取。根据事件抽取方法的不同,可以分为基于模式匹配的方法和基于机器学习的方法两大类,中文事件抽取方法还要考虑中文语言特性问题。文中全面介绍了中文事件抽取的任务和方法,并总结展望了未来的发展趋势。  相似文献   

14.
The term ‘corpus’ refers to a huge volume of structured datasets containing machine-readable texts. Such texts are generated in a natural communicative setting. The explosion of social media permitted individuals to spread data with minimal examination and filters freely. Due to this, the old problem of fake news has resurfaced. It has become an important concern due to its negative impact on the community. To manage the spread of fake news, automatic recognition approaches have been investigated earlier using Artificial Intelligence (AI) and Machine Learning (ML) techniques. To perform the medicinal text classification tasks, the ML approaches were applied, and they performed quite effectively. Still, a huge effort is required from the human side to generate the labelled training data. The recent progress of the Deep Learning (DL) methods seems to be a promising solution to tackle difficult types of Natural Language Processing (NLP) tasks, especially fake news detection. To unlock social media data, an automatic text classifier is highly helpful in the domain of NLP. The current research article focuses on the design of the Optimal Quad Channel Hybrid Long Short-Term Memory-based Fake News Classification (QCLSTM-FNC) approach. The presented QCLSTM-FNC approach aims to identify and differentiate fake news from actual news. To attain this, the proposed QCLSTM-FNC approach follows two methods such as the pre-processing data method and the Glove-based word embedding process. Besides, the QCLSTM model is utilized for classification. To boost the classification results of the QCLSTM model, a Quasi-Oppositional Sandpiper Optimization (QOSPO) algorithm is utilized to fine-tune the hyperparameters. The proposed QCLSTM-FNC approach was experimentally validated against a benchmark dataset. The QCLSTM-FNC approach successfully outperformed all other existing DL models under different measures.  相似文献   

15.
Electronic health records (EHRs), digital collections of patient healthcare events and observations, are ubiquitous in medicine and critical to healthcare delivery, operations, and research. Despite this central role, EHRs are notoriously difficult to process automatically. Well over half of the information stored within EHRs is in the form of unstructured text (e.g., provider notes, operation reports) and remains largely untapped for secondary use. Recently, however, newer neural network and deep learning approaches to Natural Language Processing (NLP) have made considerable advances, outperforming traditional statistical and rule-based systems on a variety of tasks. In this survey paper, we summarize current neural NLP methods for EHR applications. We focus on a broad scope of tasks, namely, classification and prediction, word embeddings, extraction, generation, and other topics such as question answering, phenotyping, knowledge graphs, medical dialogue, multilinguality, interpretability, etc.  相似文献   

16.
Fake news and its significance carried the significance of affecting diverse aspects of diverse entities, ranging from a city lifestyle to a country global relativity, various methods are available to collect and determine fake news. The recently developed machine learning (ML) models can be employed for the detection and classification of fake news. This study designs a novel Chaotic Ant Swarm with Weighted Extreme Learning Machine (CAS-WELM) for Cybersecurity Fake News Detection and Classification. The goal of the CAS-WELM technique is to discriminate news into fake and real. The CAS-WELM technique initially pre-processes the input data and Glove technique is used for word embedding process. Then, N-gram based feature extraction technique is derived to generate feature vectors. Lastly, WELM model is applied for the detection and classification of fake news, in which the weight value of the WELM model can be optimally adjusted by the use of CAS algorithm. The performance validation of the CAS-WELM technique is carried out using the benchmark dataset and the results are inspected under several dimensions. The experimental results reported the enhanced outcomes of the CAS-WELM technique over the recent approaches.  相似文献   

17.
事件抽取(event extraction)是自然语言处理(natural language processing,NLP)中的一个重要且有挑战性的任务,以完成从文本中识别出事件触发词(trigger)以及触发词对应的要素(argument)。对于一个句子中有多个事件的多事件抽取任务,提出了一种注意力机制的变种——动态掩蔽注意力机制(dynamic masked attention network,DyMAN),与常规注意力机制相比,动态掩蔽注意力机制能够捕捉更丰富的上下文表示并保留更有价值的信息。在ACE 2005数据集上进行的实验中,对于多事件抽取任务,与之前最好的模型JRNN相比,DyMAN模型在触发词分类任务上取得了9.8%的提升,在要素分类任务上取得了4.5%的提升,表明基于DyMAN的事件抽取模型在多事件抽取上能够实现领先的效果。  相似文献   

18.
短文本分类是自然语言处理的一个研究热点.为提高文本分类精度和解决文本表示稀疏问题,提出了一种全新的文本表示(N-of-DOC)方法.采用Word2Vec分布式表示一个短语,将其转换成的向量作为卷积神经网络模型的输入,经过卷积层和池化层提取高层特征,输出层接分类器得出分类结果.实验结果表明,与传统机器学习(K近邻,支持向量机,逻辑斯特回归,朴素贝叶斯)相比,提出的方法不仅能解决中文文本向量的维数灾难和稀疏问题,而且在分类精度上也比传统方法提高了4.23%.  相似文献   

19.
针对文本检索中的特征提取和分类问题,提出一种基于内嵌空间支持向量机的特征选择和排序学习方法。与多分类特征选择问题中常用的组合方法不同,本文提出的方法能将一个有序分类问题转化为一个两分类问题,从整体上选择最有效的特征。同时与已有的Ranking SVM相比,该方法在转换过程中学习样本的数量只有线性级的增长,从而大大提高了检索的速度。在人工数据集和标准的文本分类数据集上的实验结果表明,本文所提出的方法能较好地解决文本检索中的特征选择和排序问题。  相似文献   

20.
Lata  Kusum  Singh  Pardeep  Dutta  Kamlesh 《Applied Intelligence》2022,52(9):9816-9860

Coreference Resolution is an essential task for Natural Language Processing (NLP) application, which has a paramount impact on the performance of text summarization, machine translation, text classification, and recognizing textual entailment. Mention Detection (MD) is the core component of the coreference resolution task and is additionally a process of extraction of all possible mentions from the text. Mention is referred to as a textual representation of entities in the text, such as Name, Nominal, and Pronominal mentions. The mentions appear in the text using different representations but indicating the same entity. The performance of an MD module positively affects the performance of NLP tasks such as Coreference resolution, Relation Extraction, Information retrieval, Information extraction, etc. Incorrect identification of mentions in the text severely affects the efficiency of the coreference resolution task. This paper aims to provide a comprehensive overview for the state of the art of mention detection approaches, which is utilized in the coreference resolution task and explains the importance of MD in Coreference resolution. The subsisting approaches are classified based on the underlying techniques adopted by each approach in three categories: Rule-based mention detection, Statistics-based mention detection, and Deep learning-based mention detection. The performance of deep learning is improving as more data and more powerful computing resources become available. This study endeavors to provide a comparative analysis of various mention detection approaches and help the researchers to assimilate knowledge about the mention detection approaches from sundry aspects.

  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号