首页 | 官方网站   微博 | 高级检索  
     


Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge
Authors:Srinivas Vadrevu  Fatih Gelgi  Hasan Davulcu
Affiliation:(1) Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287, USA
Abstract:World Wide Web is transforming itself into the largest information resource making the process of information extraction (IE) from Web an important and challenging problem. In this paper, we present an automated IE system that is domain independent and that can automatically transform a given Web page into a semi-structured hierarchical document using presentation regularities. The resulting documents are weakly annotated in the sense that they might contain many incorrect annotations and missing labels. We also describe how to improve the quality of weakly annotated data by using domain knowledge in terms of a statistical domain model. We demonstrate that such system can recover from ambiguities in the presentation and boost the overall accuracy of a base information extractor by up to 20%. Our experimental evaluations with TAP data, computer science department Web sites, and RoadRunner document sets indicate that our algorithms can scale up to very large data sets.
Keywords:information extraction  web  page segmentation  grammar induction  pattern mining  semantic partitioner  metadata  domain knowledge  statistical domain model
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号