首页 | 官方网站   微博 | 高级检索  
     

基于朴素贝叶斯Web新闻内容的抽取方法
引用本文:罗永莲,赵昌垣,贾玉芳,芦彩林.基于朴素贝叶斯Web新闻内容的抽取方法[J].计算机与现代化,2016,0(1):59.
作者姓名:罗永莲  赵昌垣  贾玉芳  芦彩林
基金项目:山西省高等学校教学改革项目(J2014108); 山西省教育科学“十一五”规划课题(GH-08072)
摘    要:针对网页信息自动抽取问题,提出一种将网页按标记分块并根据朴素贝叶斯理论从中识别新闻正文的方法。该方法将各分块的标记信息、文本相似度以及字长特征作为机器学习的特征属性。为提高标记属性的表征作用,减少相关标记之间的干扰,算法采用χ2检验法来检验标记属性之间以及标记属性与类别之间的相关性并实现属性约减。新闻正文抽取过程中同时考虑正文与非正文分块的后验概率,以提高抽取准确率。实验结果表明,选取适当的参数值,抽取新闻正文的准确率达到85%。 

关 键 词:朴素贝叶斯    新闻网页    网页分块    正文抽取    相关性检验  
收稿时间:2016-01-26

Method of Web News Text Extraction Based on Nave Bayes
LUO Yong-lian,ZHAO Chang-yuan,JIA Yu-fang,LU Cai-lin.Method of Web News Text Extraction Based on Nave Bayes[J].Computer and Modernization,2016,0(1):59.
Authors:LUO Yong-lian  ZHAO Chang-yuan  JIA Yu-fang  LU Cai-lin
Abstract: Concerning the problems of information automatic extraction from Web news, a method of extracting Web news text from webpage tag blocks based on Nave Bayes was proposed. Tag information, text similarity and text length of tag blocks were taken as the attributes in machine learning. To improve the representation of tag attributes and reduce interference between related tags, the algorithm reduced the number of attributes in the way of examining the correlation between tag attributes and categories between tag attributes based on χ2 test. In order to improve the extraction accuracy, both the probability of news text and non news text were considered. The experimental results show that the accuracy rate of extraction news text reached 85% with appropriate parameter values.
Keywords:Nave Bayes  news of webpage  webpage tag block  text extraction  correlation test  
点击此处可从《计算机与现代化》浏览原始摘要信息
点击此处可从《计算机与现代化》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号