Automated analysis of images in documents for intelligent document search |
| |
Authors: | Xiaonan Lu Saurabh Kataria William J Brouwer James Z Wang Prasenjit Mitra C Lee Giles |
| |
Affiliation: | (1) Department of Computer Science and Engineering, The Pennsylvania State University, University Park, USA;(2) College of Information Sciences and Technology, The Pennsylvania State University, University Park, USA;(3) Department of Chemistry, The Pennsylvania State University, University Park, USA |
| |
Abstract: | Authors use images to present a wide variety of important information in documents. For example, two-dimensional (2-D) plots
display important data in scientific publications. Often, end-users seek to extract this data and convert it into a machine-processible
form so that the data can be analyzed automatically or compared with other existing data. Existing document data extraction
tools are semi-automatic and require users to provide metadata and interactively extract the data. In this paper, we describe
a system that extracts data from documents fully automatically, completely eliminating the need for human intervention. The
system uses a supervised learning-based algorithm to classify figures in digital documents into five classes: photographs,
2-D plots, 3-D plots, diagrams, and others. Then, an integrated algorithm is used to extract numerical data from data points
and lines in the 2-D plot images along with the axes and their labels, the data symbols in the figure’s legend and their associated
labels. We demonstrate that the proposed system and its component algorithms are effective via an empirical evaluation. Our
data extraction system has the potential to be a vital component in high volume digital libraries. |
| |
Keywords: | Image Document search Figure 2-D plot Data extraction Text block extraction |
本文献已被 SpringerLink 等数据库收录! |
|