Interactive learning of node selecting tree transducer |
| |
Authors: | Julien Carme Rémi Gilleron Aurélien Lemay Joachim Niehren |
| |
Affiliation: | (1) Mostrare project, INRIA Futurs and Lille University, LIFL, Lille |
| |
Abstract: | We develop new algorithms for learning monadic node selection queries in unranked trees from annotated examples, and apply
them to visually interactive Web information extraction.
We propose to represent monadic queries by bottom-up deterministic Node Selecting Tree Transducers (NSTTs), a particular class
of tree automata that we introduce. We prove that deterministic NSTTs capture the class of queries definable in monadic second
order logic (MSO) in trees, which Gottlob and Koch (2002) argue to have the right expressiveness for Web information extraction,
and prove that monadic queries defined by NSTTs can be answered efficiently. We present a new polynomial time algorithm in
RPNI-style that learns monadic queries defined by deterministic NSTTs from completely annotated examples, where all selected nodes
are distinguished.
In practice, users prefer to provide partial annotations. We propose to account for partial annotations by intelligent tree
pruning heuristics. We introduce pruning NSTTs—a formalism that shares many advantages of NSTTs. This leads us to an interactive learning algorithm for monadic queries
defined by pruning NSTTs, which satisfies a new formal active learning model in the style of Angluin (1987).
We have implemented our interactive learning algorithm integrated it into a visually interactive Web information extraction
system—called SQUIRREL—by plugging it into the Mozilla Web browser. Experiments on realistic Web documents confirm excellent quality with very few
user interactions during wrapper induction.
Editor: Georgios Paliouras and Yasubumi Sakakibara |
| |
Keywords: | Web information extraction Wrapper induction Grammatical inference Tree automata Monadic queries |
本文献已被 SpringerLink 等数据库收录! |
|