一种基于N-gram组合的中文垃圾邮件过滤方法 A Method Combined N-gram Based to Filter the Chinese Spam期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

一种基于N-gram组合的中文垃圾邮件过滤方法

引用本文：	刘新斌,李俊.一种基于N-gram组合的中文垃圾邮件过滤方法[J].微电子学与计算机,2004,21(12):85-91.

作者姓名：	刘新斌李俊

作者单位：	中国科学院计算机网络信息中心,北京,100080

摘要：	中文垃圾邮件的泛滥提出了极为迫切的技术诉求。本文使用了基于简单贝叶斯模型的过滤算法，同时使用N-gram对中文文本进行自动分词，并且组合多个N-gram来加快分类的收敛速度，这样分类是一种切实可行的垃圾邮件过滤方法。对于这种代价敏感性(cost sensitive)的分类，通过移动门限值的方法来处理：在评估结果时采用了TCR以及召回率(SR)和正确率(SP)等参数考察实验数据。实验表明：这种方法代价较小，而正确率较高。最后我们认为可以通过筛选训练邮件以及和其它措施相结合来满足ISP级别等应用场合的要求。
关键词：	垃圾邮件过滤 N-gram 中文文本自动分词 ISP 算法贝叶斯模型 TCR 正确率召回
文章编号：	1000-7180(2004)12-085-07
修稿时间：	2004年7月7日
A Method Combined N-gram Based to Filter the Chinese Spam

LIU Xin-bin,LI Jun.A Method Combined N-gram Based to Filter the Chinese Spam[J].Microelectronics & Computer,2004,21(12):85-91.

Authors:	LIU Xin-bin LI Jun

Abstract:	The situation that mailbox is nowadays flooded with spam in China asks urgently for a technical solution to stop them. Many researches indicate that text classification is a feasible way. A Naive Bayesian Algorithm is proposed in this paper to model the filtering and a N-gram method is also introduced to segment the Chinese text into word. Measures have been taken to classify the cost-asymmetrical problem. Values of several parameters, namely TCR (total cost ratio), SR (spam recall) and SP (spam precision), are also applied to evaluate the cost sensitivity. Results of experiments show that the proposed model can acquire a high accuracy ratio at a low cost. Thus, we can conclude that sifting the training mail corpus carefully can improve the performance, so as to meet the requirements of Isp-level application.

Keywords:	Anti-spam Chinese email Naive Bayesian Model N-gram Cost ensitive
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏