基于上下文语义的恶意域名语料提取模型研究 Research on extraction model of malicious domain corpus based on context semantics期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于上下文语义的恶意域名语料提取模型研究

引用本文：	黄诚,刘嘉勇,刘亮,何祥,汤殿华.基于上下文语义的恶意域名语料提取模型研究[J].计算机工程与应用,2018,54(9):101-108.

作者姓名：	黄诚刘嘉勇刘亮何祥汤殿华

作者单位：	1.四川大学电子信息学院，成都 610065 2.保密通信重点实验室，成都 610041

摘要：	针对目前基于白名单过滤技术在海量文本中恶意域名提取的漏报、误报等问题，提出了一种基于上下文语义的恶意域名语料提取模型。该模型分别从恶意域名所在语句的上下文单词、短语进行语义分析，并利用自然语言处理技术自动生成描述恶意域名的语料。通过该模型对公开的APT（Advanced Persistent Threat）分析文档数据提取了大量恶意域名语料数据。利用安全博客文章数据并结合基于随机森林算法的机器分类模型对论文提取的恶意语料的有效性进行了验证。
关键词：	恶意域名文本挖掘提取模型恶意语料
Research on extraction model of malicious domain corpus based on context semantics

HUANG Cheng,LIU Jiayong,LIU Liang,HE Xiang,TANG Dianhua.Research on extraction model of malicious domain corpus based on context semantics[J].Computer Engineering and Applications,2018,54(9):101-108.

Authors:	HUANG Cheng LIU Jiayong LIU Liang HE Xiang TANG Dianhua

Affiliation:	1.College of Electronics and Information Engineering, Sichuan University, Chengdu 610065, China 2.Science and Technology on Communication Security Laboratory, Chengdu 610041, China

Abstract:	To solve the problem of omitting and false positive in extracting malicious domains based on whitelist filtering technology in massive text, a contextual semantic-based model for extracting malicious domain corpus is presented. The proposed approach is based on the context words and phrases which describes malicious domains in a technical way, and natural language processing technology is used to automatically generate corpus from sentences which contain malicious domains. Malicious domain corpus is generated from many advanced persistent threat reports and articles with the proposed model. The malicious corpus extracted from documents is verified by random forest classifier.

Keywords:	malware detection text mining information extraction malicious corpus

	点击此处可从《计算机工程与应用》浏览原始摘要信息
	点击此处可从《计算机工程与应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏