首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 421 毫秒
1.
Twitter has become an important data source for detecting events, especially tracking detailed information for events of a specific domain. Previous studies on targeted-domain Twitter information extraction have used supervised learning techniques to identify domain-related tweets, however, the need for extensive manual labeling makes these supervised systems extremely expensive to build and maintain. What’s more, most of these existing work fail to consider spatiotemporal factors, which are essential attributes of target-domain events. In this paper, we propose a semi-supervised method for Automatical Targeted-domain Spatiotemporal Event Detection (ATSED) in Twitter. Given a targeted domain, ATSED first learns tweet labels from historical data, and then detects on-going events from real-time Twitter data streams. Specifically, an efficient label generation algorithm is proposed to automatically recognize tweet labels from domain-related news articles, a customized classifier is created for Twitter data analysis by utilizing tweets’ distinguishing features, and a novel multinomial spatial-scan model is provided to identify geographical locations for detected events. Experiments on 305 million tweets demonstrated the effectiveness of this new approach.  相似文献   

2.
基于集成学习的自训练算法是一种半监督算法,不少学者通过集成分类器类别投票或平均置信度的方法选择可靠样本。基于置信度的投票策略倾向选择置信度高的样本或置信度低但投票却一致的样本进行标记,后者这种情形可能会误标记靠近决策边界的样本,而采用异构集成分类器也可能会导致各基分类器对高置信度样本的类别标记不同,从而无法将其有效加入到有标记样本集。提出了结合主动学习与置信度投票策略的集成自训练算法用来解决上述问题。该算法合理调整了投票策略,选择置信度高且投票一致的无标记样本加以标注,同时利用主动学习对投票不一致而置信度较低的样本进行人工标注,以弥补集成自训练学习只关注置信度高的样本,而忽略了置信度低的样本的有用信息的缺陷。在UCI数据集上的对比实验验证了该算法的有效性。  相似文献   

3.
Twitter and Reddit are two of the most popular social media sites used today. In this paper, we study the use of machine learning and WordNet-based classifiers to generate an interest profile from a user’s tweets and use this to recommend loosely related Reddit threads which the reader is most likely to be interested in. We introduce a genre classification algorithm using a similarity measure derived from WordNet lexical database for English to label genres for nouns in tweets. The proposed algorithm generates a user’s interest profile from their tweets based on a referencing taxonomy of genres derived from the genre-tagged Brown Corpus augmented with a technology genre. The top K genres of a user’s interest profile can be used for recommending subreddit articles in those genres. Experiments using real life test cases collected from Twitter have been done to compare the performance on genre classification by using the WordNet classifier and machine learning classifiers such as SVM, Random Forests, and an ensemble of Bayesian classifiers. Empirically, we have obtained similar results from the two different approaches with a sufficient number of tweets. It seems that machine learning algorithms as well as the WordNet ontology are viable tools for developing recommendation engine based on genre classification. One advantage of the WordNet approach is simplicity and no learning is required. However, the WordNet classifier tends to have poor precision on users with very few tweets.  相似文献   

4.
主动协同半监督粗糙集分类模型   总被引:1,自引:0,他引:1  
粗糙集理论是一种有监督学习模型,一般需要适量有标记的数据来训练分类器。但现实一些问题往往存在大量无标记的数据,而有标记数据由于标记代价过大较为稀少。文中结合主动学习和协同训练理论,提出一种可有效利用无标记数据提升分类性能的半监督粗糙集模型。该模型利用半监督属性约简算法提取两个差异性较大的约简构造基分类器,然后基于主动学习思想在无标记数据中选择两分类器分歧较大的样本进行人工标注,并将更新后的分类器交互协同学习。UCI数据集实验对比分析表明,该模型能明显提高分类学习性能,甚至能达到数据集的最优值。  相似文献   

5.
Twitter spam detection is a recent area of research in which most previous works had focused on the identification of malicious user accounts and honeypot-based approaches. However, in this paper we present a methodology based on two new aspects: the detection of spam tweets in isolation and without previous information of the user; and the application of a statistical analysis of language to detect spam in trending topics. Trending topics capture the emerging Internet trends and topics of discussion that are in everybody’s lips. This growing microblogging phenomenon therefore allows spammers to disseminate malicious tweets quickly and massively. In this paper we present the first work that tries to detect spam tweets in real time using language as the primary tool. We first collected and labeled a large dataset with 34 K trending topics and 20 million tweets. Then, we have proposed a reduced set of features hardly manipulated by spammers. In addition, we have developed a machine learning system with some orthogonal features that can be combined with other sets of features with the aim of analyzing emergent characteristics of spam in social networks. We have also conducted an extensive evaluation process that has allowed us to show how our system is able to obtain an F-measure at the same level as the best state-of-the-art systems based on the detection of spam accounts. Thus, our system can be applied to Twitter spam detection in trending topics in real time due mainly to the analysis of tweets instead of user accounts.  相似文献   

6.
基于单类分类器的半监督学习   总被引:1,自引:0,他引:1  
提出一种结合单类学习器和集成学习优点的Ensemble one-class半监督学习算法.该算法首先为少量有标识数据中的两类数据分别建立两个单类分类器.然后用建立好的两个单类分类器共同对无标识样本进行识别,利用已识别的无标识样本对已建立的两个分类面进行调整、优化.最终被识别出来的无标识数据和有标识数据集合在一起训练一个基分类器,多个基分类器集成在一起对测试样本的测试结果进行投票.在5个UCI数据集上进行实验表明,该算法与tri-training算法相比平均识别精度提高4.5%,与仅采用纯有标识数据的单类分类器相比,平均识别精度提高8.9%.从实验结果可以看出,该算法在解决半监督问题上是有效的.  相似文献   

7.
Abstract

With the use of identity resolution, both information leakage and identity hacking can be reduced to some extent. In this paper, a prototype has been developed to classify Twitter users as suspicious and nonsuspicious on the basis of features which identify user demographics and their tweeting activity using Twitter APIs. A model has been devised based upon user and tweet meta-data which is used to calculate user score and tweet score, and further aggregate the values generated by these scores to label suspicious and nonsuspicious users in the collected dataset of around 21,492 Twitter users. Further, support vector machine classifier has been used to classify the labeled data. Through this paper, our analysis about the role of features and the characteristics of dataset used for the categorization of users in Twitter has been reported. The experimental results illustrate that the proposed system can identify suspicious users with an accuracy of 94.1%.  相似文献   

8.

Social media platform like Twitter is one of the primary sources for sharing real-time information at the time of events such as disasters, political events, etc. Detecting the resource tweets during a disaster is an essential task because tweets contain different types of information such as infrastructure damage, resources, opinions and sympathies of disaster events, etc. Tweets are posted related to Need and Availability of Resources (NAR) by humanitarian organizations and victims. Hence, reliable methodologies are required for detecting the NAR tweets during a disaster. The existing works don’t focus well on NAR tweets detection and also had poor performance. Hence, this paper focus on detection of NAR tweets during a disaster. Existing works often use features and appropriate machine learning algorithms on several Natural Language Processing (NLP) tasks. Recently, there is a wide use of Convolutional Neural Networks (CNN) in text classification problems. However, it requires a large amount of manual labeled data. There is no such large labeled data is available for NAR tweets during a disaster. To overcome this problem, stacking of Convolutional Neural Networks with traditional feature based classifiers is proposed for detecting the NAR tweets. In our approach, we propose several informative features such as aid, need, food, packets, earthquake, etc. are used in the classifier and CNN. The learned features (output of CNN and classifier with informative features) are utilized in another classifier (meta-classifier) for detection of NAR tweets. The classifiers such as SVM, KNN, Decision tree, and Naive Bayes are used in the proposed model. From the experiments, we found that the usage of KNN (base classifier) and SVM (meta classifier) with the combination of CNN in the proposed model outperform the other algorithms. This paper uses 2015 and 2016 Nepal and Italy earthquake datasets for experimentation. The experimental results proved that the proposed model achieves the best accuracy compared to baseline methods.

  相似文献   

9.
Combining machine learning with social network analysis (SNA) can leverage vast amounts of social media data to better respond to crises. We present a case study using Twitter data from the March 2019 Nebraska floods in the United States, which caused over $1 billion in damage in the state and widespread evacuations of residents. We use a subset of machine learning, deep learning (DL), to classify text content of 11,982 tweets, and we integrate that with SNA to understand the structure of tweet interactions. Our DL approach pre‐trains our model with a DL language technique, BERT, and then trains the model using the standard training dataset to sort a dataset of tweets into classes tailored to crisis events. Several performance measures demonstrate that our two‐tiered trained model improves domain adaptation and generalization across different extreme weather event types. This approach identifies the role of Twitter during the damage containment stage of the flood. Our SNA identifies accounts that function as primary sources of information on Twitter. Together, these two approaches help crisis managers filter large volumes of data and overcome challenges faced by simple statistical models and other computational techniques to provide useful information during crises like flooding.  相似文献   

10.
Twitter is a radiant platform with a quick and effective technique to analyze users’ perceptions of activities on social media. Many researchers and industry experts show their attention to Twitter sentiment analysis to recognize the stakeholder group. The sentiment analysis needs an advanced level of approaches including adoption to encompass data sentiment analysis and various machine learning tools. An assessment of sentiment analysis in multiple fields that affect their elevations among the people in real-time by using Naive Bayes and Support Vector Machine (SVM). This paper focused on analysing the distinguished sentiment techniques in tweets behaviour datasets for various spheres such as healthcare, behaviour estimation, etc. In addition, the results in this work explore and validate the statistical machine learning classifiers that provide the accuracy percentages attained in terms of positive, negative and neutral tweets. In this work, we obligated Twitter Application Programming Interface (API) account and programmed in python for sentiment analysis approach for the computational measure of user’s perceptions that extract a massive number of tweets and provide market value to the Twitter account proprietor. To distinguish the results in terms of the performance evaluation, an error analysis investigates the features of various stakeholders comprising social media analytics researchers, Natural Language Processing (NLP) developers, engineering managers and experts involved to have a decision-making approach.  相似文献   

11.
基于Tri-training的半监督SVM   总被引:1,自引:1,他引:0       下载免费PDF全文
当前机器学习面临的主要问题之一是如何有效地处理海量数据,而标记训练数据是十分有限且不易获得的。提出了一种新的半监督SVM算法,该算法在对SVM训练中,只要求少量的标记数据,并能利用大量的未标记数据对分类器反复的修正。在实验中发现,Tri-training的应用确实能够提高SVM算法的分类精度,并且通过增大分类器间的差异性能够获得更好的分类效果,所以Tri-training对分类器的要求十分宽松,通过SVM的不同核函数来体现分类器之间的差异性,进一步改善了协同训练的性能。理论分析与实验表明,该算法具有较好的学习效果。  相似文献   

12.
A major problem in monitoring the online reputation of companies, brands, and other entities is that entity names are often ambiguous (apple may refer to the company, the fruit, the singer, etc.). The problem is particularly hard in microblogging services such as Twitter, where texts are very short and there is little context to disambiguate. In this paper we address the filtering task of determining, out of a set of tweets that contain a company name, which ones do refer to the company. Our approach relies on the identification of filter keywords: those whose presence in a tweet reliably confirm (positive keywords) or discard (negative keywords) that the tweet refers to the company.We describe an algorithm to extract filter keywords that does not use any previously annotated data about the target company. The algorithm allows to classify 58% of the tweets with 75% accuracy; and those can be used to feed a machine learning algorithm to obtain a complete classification of all tweets with an overall accuracy of 73%. In comparison, a 10-fold validation of the same machine learning algorithm provides an accuracy of 85%, i.e., our unsupervised algorithm has a 14% loss with respect to its supervised counterpart.Our study also shows that (i) filter keywords for Twitter does not directly derive from the public information about the company in the Web: a manual selection of keywords from relevant web sources only covers 15% of the tweets with 86% accuracy; (ii) filter keywords can indeed be a productive way of classifying tweets: the five best possible keywords cover, in average, 28% of the tweets for a company in our test collection.  相似文献   

13.
Visual data classification using insufficient labeled data is a well-known hard problem. Semi-supervise learning, which attempts to exploit the unlabeled data in additional to the labeled ones, has attracted much attention in recent years. This paper proposes a novel semi-supervised classifier called discriminative deep belief networks (DDBN). DDBN utilizes a new deep architecture to integrate the abstraction ability of deep belief nets (DBN) and discriminative ability of backpropagation strategy. For unsupervised learning, DDBN inherits the advantage of DBN, which preserves the information well from high-dimensional features space to low-dimensional embedding. For supervised learning, through a well designed objective function, the backpropagation strategy directly optimizes the classification results in training dataset by refining the parameter space. Moreover, we apply DDBN to visual data classification task and observe an important fact that the learning ability of deep architecture is seriously underrated in real-world applications, especially in visual data analysis. The comparative experiments on standard datasets of different types and different scales demonstrate that the proposed algorithm outperforms both representative semi-supervised classifiers and existing deep learning techniques. For visual dataset, we can further improve the DDBN performance with much larger and deeper architecture.  相似文献   

14.
Web spam是指通过内容作弊和网页间链接作弊来欺骗搜索引擎,从而提升自身搜索排名的作弊网页,它干扰了搜索结果的准确性和相关性。提出基于Co-Training模型的Web spam检测方法,使用了网页的两组相互独立的特征——基于内容的统计特征和基于网络图的链接特征,分别建立两个独立的基本分类器;使用Co-Training半监督式学习算法,借助大量未标记数据来改善分类器质量。在WEB SPAM-UK2007数据集上的实验证明:算法改善了SVM分类器的效果。  相似文献   

15.
Sentiment analysis focuses on identifying and classifying the sentiments expressed in text messages and reviews. Social networks like Twitter, Facebook, and Instagram generate heaps of data filled with sentiments, and the analysis of such data is very fruitful when trying to improve the quality of both products and services alike. Classic machine learning techniques have a limited capability to efficiently analyze such large amounts of data and produce precise results; they are thus supported by deep learning models to achieve higher accuracy. This study proposes a combination of convolutional neural network and long short‐term memory (CNN‐LSTM) deep network for performing sentiment analysis on Twitter datasets. The performance of the proposed model is analyzed with machine learning classifiers, including the support vector classifier, random forest (RF), stochastic gradient descent (SGD), logistic regression, a voting classifier (VC) of RF and SGD, and state‐of‐the‐art classifier models. Furthermore, two feature extraction methods (term frequency‐inverse document frequency and word2vec) are also investigated to determine their impact on prediction accuracy. Three datasets (US airline sentiments, women's e‐commerce clothing reviews, and hate speech) are utilized to evaluate the performance of the proposed model. Experiment results demonstrate that the CNN‐LSTM achieves higher accuracy than those of other classifiers.  相似文献   

16.
网络作弊检测是搜索引擎的重要挑战之一,该文提出基于遗传规划的集成学习方法 (简记为GPENL)来检测网络作弊。该方法首先通过欠抽样技术从原训练集中抽样得到t个不同的训练集;然后使用c个不同的分类算法对t个训练集进行训练得到t*c个基分类器;最后利用遗传规划得到t*c个基分类器的集成方式。新方法不仅将欠抽样技术和集成学习融合起来提高非平衡数据集的分类性能,还能方便地集成不同类型的基分类器。在WEBSPAM-UK2006数据集上所做的实验表明无论是同态集成还是异态集成,GPENL均能提高分类的性能,且异态集成比同态集成更加有效;GPENL比AdaBoost、Bagging、RandomForest、多数投票集成、EDKC算法和基于Prediction Spamicity的方法取得更高的F-度量值。  相似文献   

17.
Receiver operating characteristic (ROC) analysis is a standard methodology to evaluate the performance of a binary classification system. The area under the ROC curve (AUC) is a performance metric that summarizes how well a classifier separates two classes. Traditional AUC optimization techniques are supervised learning methods that utilize only labeled data (i.e., the true class is known for all data) to train the classifiers. In this work, inspired by semi-supervised and transductive learning, we propose two new AUC optimization algorithms hereby referred to as semi-supervised learning receiver operating characteristic (SSLROC) algorithms, which utilize unlabeled test samples in classifier training to maximize AUC. Unlabeled samples are incorporated into the AUC optimization process, and their ranking relationships to labeled positive and negative training samples are considered as optimization constraints. The introduced test samples will cause the learned decision boundary in a multi-dimensional feature space to adapt not only to the distribution of labeled training data, but also to the distribution of unlabeled test data. We formulate the semi-supervised AUC optimization problem as a semi-definite programming problem based on the margin maximization theory. The proposed methods SSLROC1 (1-norm) and SSLROC2 (2-norm) were evaluated using 34 (determined by power analysis) randomly selected datasets from the University of California, Irvine machine learning repository. Wilcoxon signed rank tests showed that the proposed methods achieved significant improvement compared with state-of-the-art methods. The proposed methods were also applied to a CT colonography dataset for colonic polyp classification and showed promising results.1  相似文献   

18.
Web-blogging sites such as Twitter and Facebook are heavily influenced by emotions, sentiments, and data in the modern era. Twitter, a widely used microblogging site where individuals share their thoughts in the form of tweets, has become a major source for sentiment analysis. In recent years, there has been a significant increase in demand for sentiment analysis to identify and classify opinions or expressions in text or tweets. Opinions or expressions of people about a particular topic, situation, person, or product can be identified from sentences and divided into three categories: positive for good, negative for bad, and neutral for mixed or confusing opinions. The process of analyzing changes in sentiment and the combination of these categories is known as “sentiment analysis.” In this study, sentiment analysis was performed on a dataset of 90,000 tweets using both deep learning and machine learning methods. The deep learning-based model long-short-term memory (LSTM) performed better than machine learning approaches. Long short-term memory achieved 87% accuracy, and the support vector machine (SVM) classifier achieved slightly worse results than LSTM at 86%. The study also tested binary classes of positive and negative, where LSTM and SVM both achieved 90% accuracy.  相似文献   

19.
检测恶意URL对防御网络攻击有着重要意义. 针对有监督学习需要大量有标签样本这一问题, 本文采用半监督学习方式训练恶意URL检测模型, 减少了为数据打标签带来的成本开销. 在传统半监督学习协同训练(co-training)的基础上进行了算法改进, 利用专家知识与Doc2Vec两种方法预处理的数据训练两个分类器, 筛选两个分类器预测结果相同且置信度高的数据打上伪标签(pseudo-labeled)后用于分类器继续学习. 实验结果表明, 本文方法只用0.67%的有标签数据即可训练出检测精确度(precision)分别达到99.42%和95.23%的两个不同类型分类器, 与有监督学习性能相近, 比自训练与协同训练表现更优异.  相似文献   

20.
当前已有的数据流分类模型都需要大量已标记样本来进行训练,但在实际应用中,对大量样本标记的成本相对较高。针对此问题,提出了一种基于半监督学习的数据流混合集成分类算法SMEClass,选用混合模式来组织基础分类器,用K个决策树分类器投票表决为未标记数据添加标记,以提高数据类标的置信度,增强集成分类器的准确度,同时加入一个贝叶斯分类器来有效减少标记过程中产生的噪音数据。实验结果显示,SMEClass算法与最新基于半监督学习的集成分类算法相比,其准确率有所提高,在运行时间和抗噪能力方面有明显优势。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号