首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 781 毫秒
1.

Twitter has nowadays become a trending microblogging and social media platform for news and discussions. Since the dramatic increase in its platform has additionally set off a dramatic increase in spam utilization in this platform. For Supervised machine learning, one always finds a need to have a labeled dataset of Twitter. It is desirable to design a semi-supervised labeling technique for labeling newly prepared recent datasets. To prepare the labeled dataset lot of human affords are required. This issue has motivated us to propose an efficient approach for preparing labeled datasets so that time can be saved and human errors can be avoided. Our proposed approach relies on readily available features in real-time for better performance and wider applicability. This work aims at collecting the most recent tweets of a user using Twitter streaming and prepare a recent dataset of Twitter. Finally, a semi-supervised machine learning algorithm based on the self-training technique was designed for labeling the tweets. Semi-supervised support vector machine and semi-supervised decision tree classifiers were used as base classifiers in the self-training technique. Further, the authors have applied K means clustering algorithm to the tweets based on the tweet content. The principled novel approach is an ensemble of semi-supervised and unsupervised learning wherein it was found that semi-supervised algorithms are more accurate in prediction than unsupervised ones. To effectively assign the labels to the tweets, authors have implemented the concept of voting in this novel approach and the label pre-directed by the majority voting classifier is the actual label assigned to the tweet dataset. Maximum accuracy of 99.0% has been reported in this paper using a majority voting classifier for spam labeling.

  相似文献   

2.
Twitter spam detection is a recent area of research in which most previous works had focused on the identification of malicious user accounts and honeypot-based approaches. However, in this paper we present a methodology based on two new aspects: the detection of spam tweets in isolation and without previous information of the user; and the application of a statistical analysis of language to detect spam in trending topics. Trending topics capture the emerging Internet trends and topics of discussion that are in everybody’s lips. This growing microblogging phenomenon therefore allows spammers to disseminate malicious tweets quickly and massively. In this paper we present the first work that tries to detect spam tweets in real time using language as the primary tool. We first collected and labeled a large dataset with 34 K trending topics and 20 million tweets. Then, we have proposed a reduced set of features hardly manipulated by spammers. In addition, we have developed a machine learning system with some orthogonal features that can be combined with other sets of features with the aim of analyzing emergent characteristics of spam in social networks. We have also conducted an extensive evaluation process that has allowed us to show how our system is able to obtain an F-measure at the same level as the best state-of-the-art systems based on the detection of spam accounts. Thus, our system can be applied to Twitter spam detection in trending topics in real time due mainly to the analysis of tweets instead of user accounts.  相似文献   

3.
With the rise of social networking services such as Facebook and Twitter, the problem of spam and content pollution has become more significant and intractable. Using social networking services, users are able to develop relationships and share messages with others in a very convenient manner; however, they are vulnerable to receiving spam messages. The automatic detection of spammers or content polluters on the network can effectively reduce the burden on the service provider in making a decision on appropriate counteractions. Content polluters can be automatically identified by using the supervised learning technique of artificial intelligence. To build a classification model with high accuracy automatically from the training data set, it is important to identify a set of useful features that can classify polluters and non-polluters. Moreover, because we deal with a huge amount of raw data in this process, the efficiency of data preparation and model creation are also critical issues that need to be addressed. In this paper, we present an efficient method for detecting content polluters on Twitter. Specifically, we propose a set of features that can be easily extracted from the messages and behaviors of Twitter users and construct a new breed of classifiers based on these features. The proposed approach requires only a minimal number of feature values per Twitter user and thus adds considerably less time to the overall mining process compared to other methods. Experiments confirm that the proposed approach outperforms previous approaches in both classification accuracy and processing time.  相似文献   

4.
Highly discriminative statistical features for email classification   总被引:2,自引:2,他引:0  
This paper reports on email classification and filtering, more specifically on spam versus ham and phishing versus spam classification, based on content features. We test the validity of several novel statistical feature extraction methods. The methods rely on dimensionality reduction in order to retain the most informative and discriminative features. We successfully test our methods under two schemas. The first one is a classic classification scenario using a 10-fold cross-validation technique for several corpora, including four ground truth standard corpora: Ling-Spam, SpamAssassin, PU1, and a subset of the TREC 2007 spam corpus, and one proprietary corpus. In the second schema, we test the anticipatory properties of our extracted features and classification models with two proprietary datasets, formed by phishing and spam emails sorted by date, and with the public TREC 2007 spam corpus. The contributions of our work are an exhaustive comparison of several feature selection and extraction methods in the frame of email classification on different benchmarking corpora, and the evidence that especially the technique of biased discriminant analysis offers better discriminative features for the classification, gives stable classification results notwithstanding the amount of features chosen, and robustly retains their discriminative value over time and data setups. These findings are especially useful in a commercial setting, where short profile rules are built based on a limited number of features for filtering emails.  相似文献   

5.
Online social networks have become immensely popular in recent years and have become the major sources for tracking the reverberation of events and news throughout the world. However, the diversity and popularity of online social networks attract malicious users to inject new forms of spam. Spamming is a malicious activity where a fake user spreads unsolicited messages in the form of bulk message, fraudulent review, malware/virus, hate speech, profanity, or advertising for marketing scam. In addition, it is found that spammers usually form a connected community of spam accounts and use them to spread spam to a large set of legitimate users. Consequently, it is highly desirable to detect such spammer communities existing in social networks. Even though a significant amount of work has been done in the field of detecting spam messages and accounts, not much research has been done in detecting spammer communities and hidden spam accounts. In this work, an unsupervised approach called SpamCom is proposed for detecting spammer communities in Twitter. We model the Twitter network as a multilayer social network and exploit the existence of overlapping community-based features of users represented in the form of Hypergraphs to identify spammers based on their structural behavior and URL characteristics. The use of community-based features, graph and URL characteristics of user accounts, and content similarity among users make our technique very robust and efficient.  相似文献   

6.
Artificial immune system inspired behavior-based anti-spam filter   总被引:2,自引:1,他引:1  
This paper proposes a novel behavior-based anti-spam technology for email service based on an artificial immune-inspired clustering algorithm. The suggested method is capable of continuously delivering the most relevant spam emails from the collection of all spam emails that are reported by the members of the network. Mail servers could implement the anti-spam technology by using the “black lists” that have been already recognized. Two main concepts are introduced, which defines the behavior-based characteristics of spam and to continuously identify the similar groups of spam when processing the spam streams. Experiment results using real-world datasets reveal that the proposed technology is reliable, efficient and scalable. Since no single technology can achieve one hundred percent spam detection with zero false positives, the proposed method may be used in conjunction with other filtering systems to minimize errors.  相似文献   

7.
Social networks once being an innoxious platform for sharing pictures and thoughts among a small online community of friends has now transformed into a powerful tool of information, activism, mobilization, and sometimes abuse. Detecting true identity of social network users is an essential step for building social media an efficient channel of communication. This paper targets the microblogging service, Twitter, as the social network of choice for investigation. It has been observed that dissipation of pornographic content and promotion of followers market are actively operational on Twitter. This clearly indicates loopholes in the Twitter’s spam detection techniques. Through this work, five types of spammers-sole spammers, pornographic users, followers market merchants, fake, and compromised profiles have been identified. For the detection purpose, data of around 1 Lakh Twitter users with their 20 million tweets has been collected. Users have been classified based on trust, user and content based features using machine learning techniques such as Bayes Net, Logistic Regression, J48, Random Forest, and AdaBoostM1. The experimental results show that Random Forest classifier is able to predict spammers with an accuracy of 92.1%. Based on these initial classification results, a novel system for real-time streaming of users for spam detection has been developed. We envision that such a system should provide an indication to Twitter users about the identity of users in real-time.  相似文献   

8.
张建  严珂  马祥 《计算机应用》2022,42(3):770-777
垃圾信息的识别是自然语言处理方面主要的任务之一.传统方法是基于文本特征或词频的方法,其识别准确率主要依赖于特定关键词的出现与否,存在对关键词识别错误或对未出现关键词的垃圾信息文本识别能力较差的问题,提出基于神经网络的方法.首先,利用传统方法针对这一类垃圾信息文本进行识别训练和测试;然后,利用从垃圾短信、广告和垃圾邮件数...  相似文献   

9.
Social networking sites (SNS) are quickly becoming one of the most popular tools for social interaction and information exchange. Previous research has shown a relationship between users’ personality and SNS use. Using a general population sample (N = 300), this study furthers such investigations by examining the personality correlates (Neuroticism, Extraversion, Openness-to-Experience, Agreeableness, Conscientiousness, Sociability and Need-for-Cognition) of social and informational use of the two largest SNS: Facebook and Twitter. Age and Gender were also examined. Results showed that personality was related to online socialising and information seeking/exchange, though not as influential as some previous research has suggested. In addition, a preference for Facebook or Twitter was associated with differences in personality. The results reveal differential relationships between personality and Facebook and Twitter usage.  相似文献   

10.
This paper investigates the use of statistical dimensionality reduction (DR) techniques for discriminative low dimensional embedding to enable affective movement recognition. Human movements are defined by a collection of sequential observations (time-series features) representing body joint angle or joint Cartesian trajectories. In this work, these sequential observations are modelled as temporal functions using B-spline basis function expansion, and dimensionality reduction techniques are adapted to enable application to the functional observations. The DR techniques adapted here are: Fischer discriminant analysis (FDA), supervised principal component analysis (PCA), and Isomap. These functional DR techniques along with functional PCA are applied on affective human movement datasets and their performance is evaluated using leave-one-out cross validation with a one-nearest neighbour classifier in the corresponding low-dimensional subspaces. The results show that functional supervised PCA outperforms the other DR techniques examined in terms of classification accuracy and time resource requirements.  相似文献   

11.
In this paper we investigate the use of a multimodal feature learning approach, using neural network based models such as Skip-gram and Denoising Autoencoders, to address sentiment analysis of micro-blogging content, such as Twitter short messages, that are composed by a short text and, possibly, an image. The approach used in this work is motivated by the recent advances in: i) training language models based on neural networks that have proved to be extremely efficient when dealing with web-scale text corpora, and have shown very good performances when dealing with syntactic and semantic word similarities; ii) unsupervised learning, with neural networks, of robust visual features, that are recoverable from partial observations that may be due to occlusions or noisy and heavily modified images. We propose a novel architecture that incorporates these neural networks, testing it on several standard Twitter datasets, and showing that the approach is efficient and obtains good classification results.  相似文献   

12.
Using the rational actor perspective as a guiding frame, this exploratory study examined individuals’ social media diet (i.e., amount, frequency, and duration of use) as a function of task load and expected goal attainment. Surveys were distributed (N = 337) focusing on Twitter and Facebook usage for informational and relational purposes, respectfully. Increased task load – conceptualized as a cognitive cost – directly negatively influenced Twitter use but only indirectly influenced Facebook use as a function of perceived benefits. Across conditions, perceived self-efficacy was negatively associated with perceived task load and positively associated with goal attainment, and goal attainment was a significant correlate of increased social media usage. Interpreted, we see that a transparent technology such as Facebook has no cognitive costs associated with its use, while an opaque technology such as Twitter seems to have a salient cognitive cost element. Further, we found that older users of Facebook were more likely to judge the channel as more cognitively demanding and themselves as having lower self-efficacy in using it. Finally, results indicated that for both Facebook and Twitter, males perceived both channels as more cognitively demanding than females. Theoretical and practical explanations and applications for these findings are presented.  相似文献   

13.
14.
目的 糖尿病视网膜病变(diabetic retinopathy,DR)是一种病发率和致盲率都很高的糖尿病并发症。临床中,由于视网膜图像不同等级之间差异性小以及临床医生经验的不同,会出现误诊、漏诊等情况,目前基于人工DR的诊断分类性能差且耗时费力。基于此,本文提出一种融合注意力机制(attention mechanism)和高效率网络(high-efficiency network,EfficientNet)的DR影像自动分类识别方法,以此达到对病变类型的精确诊断。方法 针对实验中DR数据集存在的问题,进行剔除、去噪、扩增和归一化等处理;利用EfficientNet进行特征提取,采用迁移学习的策略用DR的数据集对EfficientNet进行学习与训练,提取深度特征。为了解决病变之间差异小的问题,防止网络对糖尿病视网膜图像的特征学习时出现错分等情况,在EfficientNet输出结果上加入注意力机制;根据网络提取的特征在深度分类器中进行分类,将视网膜图像按等级进行五分类。结果 本文方法的分类精度、敏感性、特异性和二次加权(kappa)值分别为97.2%、95.6%、98.7%和0.84,具有较好的分类性能及鲁棒性。结论 基于融合注意力机制的高效率网络(attention EfficientNet,A-EfficientNet)的DR分类算法有效地提高了DR筛查效率,解决了人工分类的手动提取特征的局限性,在临床上对医生诊断起到了辅助作用,能更有效地防治此类恶性眼疾造成严重视力损伤、甚至失明。  相似文献   

15.

In this paper, with respect to reviewing and comparing existing social networks’ datasets, we introduce SNEFL dataset: the first social network dataset that includes the level of users’ likes (fuzzy like) data in addition to the likes between users. With users’ privacy in mind, the data has been collected from a social network. It includes several additional features including age, gender, marital status, height, weight, educational level and religiosity of the users. We have described its structure, analysed its features and evaluated its advantages in comparison with other social network datasets. On top of that, using unique feature of SNEFL dataset (fuzzy like) for the first time a rule-based algorithm has been developed to detect involuntary celibates (Incels) in social networks. Despite Incels activities in online social networks, until now no study on computer science has been performed to identify them. This study is the first step to address this challenge that society is facing today. Experimental results show that the accuracy of the proposed algorithm in identifying Incels among all social network users is 23.21% and among users who have fuzzy like data is 68.75%. In addition to the Incel detection, SNEFL dataset can be used by researchers in different fields to produce more accurate results. Some study areas that SNEFL dataset can be used in are network analysis, frequent pattern mining, classification and clustering.

  相似文献   

16.
目的 糖尿病性视网膜病变(DR)是目前比较严重的一种致盲眼病,因此,对糖尿病性视网膜病理图像的自动分类具有重要的临床应用价值。基于人工分类视网膜图像的方法存在判别性特征提取困难、分类性能差、耗时费力且很难得到客观统一的医疗诊断等问题,为此,提出一种基于卷积神经网络和分类器的视网膜病理图像自动分类系统。方法 首先,结合现有的视网膜图像的特点,对图像进行去噪、数据扩增、归一化等预处理操作;其次,在AlexNet网络的基础上,在网络的每一个卷积层和全连接层前引入一个批归一化层,得到一个网络层次更复杂的深度卷积神经网络BNnet。BNnet网络用于视网膜图像的特征提取网络,对其训练时采用迁移学习的策略利用ILSVRC2012数据集对BNnet网络进行预训练,再将训练得到的模型迁移到视网膜图像上再学习,提取用于视网膜分类的深度特征;最后,将提取的特征输入一个由全连接层组成的深度分类器将视网膜图像分为正常的视网膜图像、轻微病变的视网膜图像、中度病变的视网膜图像等5类。结果 实验结果表明,本文方法的分类准确率可达0.93,优于传统的直接训练方法,且具有较好的鲁棒性和泛化性。结论 本文提出的视网膜病理图像分类框架有效地避免了人工特征提取和图像分类的局限性,同时也解决了样本数据不足而导致的过拟合问题。  相似文献   

17.
With the recent advances in positioning and smartphone technologies, a number of social networks such as Twitter, Foursquare and Facebook are acquiring the dimension of location, thus bridging the gap between the physical world and online social networking services. Most of the location-based social networks released check-in services that allow users to share their visiting locations with their friends. In this paper, users' interests are modeled by check-in actions. We propose a new type of Spatial-aware Interest Group (SIG) query that retrieves a user group of size k where each user is interested in the query keywords and they are close to each other in the Euclidean space. We prove that the SIG query problem is NP-complete. A family of efficient algorithms based on the IR-tree is thus proposed for the processing of SIG queries. Experiments on two real datasets show that our proposed algorithms achieve orders of magnitude improvement over the baseline algorithm.  相似文献   

18.
Liu  Bo  Ni  Zeyang  Luo  Junzhou  Cao  Jiuxin  Ni  Xudong  Liu  Benyuan  Fu  Xinwen 《World Wide Web》2019,22(6):2953-2975

Social networking websites with microblogging functionality, such as Twitter or Sina Weibo, have emerged as popular platforms for discovering real-time information on the Web. Like most Internet services, these websites have become the targets of spam campaigns, which contaminate Web contents and damage user experiences. Spam campaigns have become a great threat to social network services. In this paper, we investigate crowd-retweeting spam in Sina Weibo, the counterpart of Twitter in China. We carefully analyze the characteristics of crowd-retweeting spammers in terms of their profile features, social relationships and retweeting behaviors. We find that although these spammers are likely to connect more closely than legitimate users, the underlying social connections of crowd-retweeting campaigns are different from those of other existing spam campaigns because of the unique features of retweets that are spread in a cascade. Based on these findings, we propose retweeting-aware link-based ranking algorithms to infer more suspicious accounts by using identified spammers as seeds. Our evaluation results show that our algorithms are more effective than other link-based strategies.

  相似文献   

19.
During the last decade, online social networks such as Facebook™ (Facebook) grew rapidly in popularity and this was due in no small measure to use of these media by adolescents. For many teenagers and young adults, Facebook represents a social institution that can be used by adolescents not only for sharing basic information and for connecting with others, but also as a platform for exploring and divulging information about their identities. To examine issues related to questions about the formation and disclosure of identity-related information by late adolescents, this study investigates the relationship between disclosures of intimate information by late adolescents through Facebook and their stage of psychosocial development. To examine disclosure behaviors of young college students on Facebook, we conducted focus groups in conjunction with a content analysis of Facebook profiles. Findings point to an extended adolescence period resting on the identity construction dilemma posed by digital social networks.  相似文献   

20.
Recently, we observed a great movement of Internet users from blogs and wiki to Myspace, Facebook, Twitter, and Plurk. To increase profit margins, identifying the service factors that attract customers to join a new virtual community is crucial for website enterprises. Most works focus on social networks, success factors, and reasons for participating. Relatively few works have attempted to address this issue. Additionally, feature selection techniques have been widely used to extract crucial attributes. However, domain knowledge, which is hard to obtain, is needed to select meaningful features. Therefore, this work proposes a Kano's model based Neural Network (KANN) method combining Kano's model and neural networks to identify the key service factors. Finally, a real case study of a survey of virtual community users is provided to identify the key service factors.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号