基于相似度的中文网页正文提取算法 An Extraction Algorithm of Chinese HTML Content Based on Similarity期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于相似度的中文网页正文提取算法

引用本文：	熊子奇,张晖,林茂松.基于相似度的中文网页正文提取算法[J].四川建材学院学报,2010(1):80-84.

作者姓名：	熊子奇张晖林茂松

作者单位：	西南科技大学计算机科学与技术学院,四川绵阳621010

基金项目：	基金项目：国家人事部留学归国人员启动基金（07ZD0105）;西南科技大学留学归国人员启动基金（07ZX0102）.

摘要：	网页正文提取是WEB挖掘的重要步骤。传统网页正文提取方法都需要经过分块这一步骤之后来识别网页正文块，提出了利用行文本之间的内容相似度和标签相似度结合的方法来提取网页正文。该算法避免了传统网页提取算法的分块步骤，在规范网页之后，先提取网页的最大文本行，然后计算每行文本与最大行的内容相似度和标签相似度，再结合内容相似度与标签相似度来提取网页正文。实验中，利用随机抽取的网页进行了测试，其测试精度接近95％，表明该算法在实际中是有效的。
关键词：	内容相似度标签相似度分块文本挖掘
An Extraction Algorithm of Chinese HTML Content Based on Similarity

Authors:	XIONG Zi-qi ZHANG Hui LIN Mao-song

Affiliation:	(School of Computer Science and Technology, Southwest University of Science and Technology, Mianyang 621010, Sichuan, China)

Abstract:	HTML Extraction is important to WEB Mining. A new web page content extracting method was proposed. It combined content similarity and tag similarity of line text to extract web page content. This approach avoided a traditional step called web page blocking when dealing with web pages. It first extracted the largest text line and computes the similarity of line text and line tags between each line, then, used text similarity and tag similarity to extract web page content. Finally some web pages have been collected to test this approach. In experiments, the accuracy of this approach closes to 95%, which shows that this method is effective in practice.

Keywords:	Text similarity Tag similarity Blocking Text Mining
本文献已被维普等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏