基于朴素贝叶斯Web新闻内容的抽取方法 Method of Web News Text Extraction Based on Nave Bayes期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于朴素贝叶斯Web新闻内容的抽取方法

引用本文：	罗永莲,赵昌垣,贾玉芳,芦彩林.基于朴素贝叶斯Web新闻内容的抽取方法[J].计算机与现代化,2016,0(1):59.

作者姓名：	罗永莲赵昌垣贾玉芳芦彩林

基金项目：	山西省高等学校教学改革项目(J2014108); 山西省教育科学“十一五”规划课题(GH-08072)

摘要：	针对网页信息自动抽取问题，提出一种将网页按标记分块并根据朴素贝叶斯理论从中识别新闻正文的方法。该方法将各分块的标记信息、文本相似度以及字长特征作为机器学习的特征属性。为提高标记属性的表征作用，减少相关标记之间的干扰，算法采用χ2检验法来检验标记属性之间以及标记属性与类别之间的相关性并实现属性约减。新闻正文抽取过程中同时考虑正文与非正文分块的后验概率，以提高抽取准确率。实验结果表明，选取适当的参数值，抽取新闻正文的准确率达到85%。 
关键词：	朴素贝叶斯新闻网页网页分块正文抽取相关性检验
收稿时间：	2016-01-26
Method of Web News Text Extraction Based on Nave Bayes

LUO Yong-lian,ZHAO Chang-yuan,JIA Yu-fang,LU Cai-lin.Method of Web News Text Extraction Based on Nave Bayes[J].Computer and Modernization,2016,0(1):59.

Authors:	LUO Yong-lian ZHAO Chang-yuan JIA Yu-fang LU Cai-lin

Abstract:	Concerning the problems of information automatic extraction from Web news, a method of extracting Web news text from webpage tag blocks based on Nave Bayes was proposed. Tag information, text similarity and text length of tag blocks were taken as the attributes in machine learning. To improve the representation of tag attributes and reduce interference between related tags, the algorithm reduced the number of attributes in the way of examining the correlation between tag attributes and categories between tag attributes based on χ2 test. In order to improve the extraction accuracy, both the probability of news text and non news text were considered. The experimental results show that the accuracy rate of extraction news text reached 85% with appropriate parameter values.

Keywords:	Nave Bayes news of webpage webpage tag block text extraction correlation test

	点击此处可从《计算机与现代化》浏览原始摘要信息
	点击此处可从《计算机与现代化》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏