首页 | 官方网站   微博 | 高级检索  
     

基于TF-IDF改进算法的聚焦主题网络爬虫
引用本文:王景中,邱铜相. 基于TF-IDF改进算法的聚焦主题网络爬虫[J]. 计算机应用, 2015, 35(10): 2901-2904. DOI: 10.11772/j.issn.1001-9081.2015.10.2901
作者姓名:王景中  邱铜相
作者单位:北方工业大学 计算机学院, 北京 100144
基金项目:国家自然科学基金资助项目(61371142);北京市创新团队建设提升计划项目(IDHT20130502)。
摘    要:针对传统的TF-IDF算法、K-means算法、自适应遗传算法在网络检索结果中含有大量不相关数据、语义检索准确性不高的问题,研究了TF-IDF算法的改进及其在语义检索中的应用。将正则表达式和语义分析技术相结合,从而实现对TF-IDF算法的改进。利用语义库对搜索主题进行描述,根据正则原子语义的重要性和在网页标签中的不同位置进行加权计算,得到正则原子在文档中的相似度。通过空间向量模型对文档相似度和主题模型进行余弦运算,从而获取最终的搜索结果。最后,将改进的TF-IDF算法、传统的TF-IDF算法、K-means算法和自适应遗传算法运用于聚焦主题网络爬虫中,对其检索结果进行了对比分析。计算结果表明,在聚焦主题网络爬虫语义分析的垂直搜索中,改进TF-IDF算法的相似度准确率比传统的TF-IDF算法检索准确率提高了17.1个百分点,遗漏率降低了7.76个百分点;比K-means算法检索准确率提高6个百分点;比自适应遗传算法检索准确率提高了8.1个百分点。总之,改进的TF-IDF算法可以有效地提高文档相似度检测的准确率,很好地改善聚焦主题网络爬虫在语义分析中的缺陷。

关 键 词:网络爬虫  语义分析  搜索引擎  TF-IDF  主题爬虫  文档相似度  
收稿时间:2015-05-11
修稿时间:2015-07-11

Focused topic Web crawler based on improved TF-IDF alogorithm
WANG Jingzhong,QIU Tongxiang. Focused topic Web crawler based on improved TF-IDF alogorithm[J]. Journal of Computer Applications, 2015, 35(10): 2901-2904. DOI: 10.11772/j.issn.1001-9081.2015.10.2901
Authors:WANG Jingzhong  QIU Tongxiang
Affiliation:School of Computer, North China University of Technology, Beijing 100144, China
Abstract:Considering a large number of irrelevant data in Web search results and low accuracy of semantic retrieval by using the traditional TF-IDF algorithm, K-means algorithm and the adaptive genetic algorithm, the improvement of the TF-IDF algorithm and its application in semantic retrieval were studied. The TF-IDF algorithm was improved successfully by applying the regular expression to the semantic analysis technique. The search topic was described by a semantic database. The similarity of the regular atoms in the documents was obtained by a weighted calculation, which was according to the importance of the regular atomic semantics and the different positions in the Web pages. The final results were obtained by a Cosine operation of the document similarity and subject mode through the space vector model. Finally, the calculating results were analyzed by applying the improved TF-IDF algorithm, the traditional TF-IDF algorithm, the K-means algorithm and the adaptive genetic algorithm to the focused topic Web crawler. The results show that the accuracy of the improved TF-IDF algorithm rose by 17.1 percentage points and the omission rate of that reduced by 7.76 percentage points in the vertical search of the focused topic web crawler. Compared with the K-means algorithm and the adaptive genetic algorithm, the accuracy of the improved TF-IDF algorithm rose by 6 percentage points and 8.1 percentage points, respectively. In summary, the improved TF-IDF algorithm can promote the accuracy of document similarity detection effectively and improve the defect of focused topic web crawler in the semantic analysis greatly.
Keywords:Web spider  semantic analysis  search engine  Term Frequency-Inverse Document Frequency (TF-IDF)  title spider  document correlation degree  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号