Tamir Hassan, Robert Baumgartner:
Using Graph Matching Techniques to Wrap Data from PDF Documents.

In: Proceedings of 15th International World Wide Web Conference (WWW2006), Edinburgh, Scotland (23rd - 26th May 2006), 901-902, May 2006
Wrapping is the process of navigating a data source, semi- automatically extracting data and transforming it into a form suitable for data processing applications. There are currently a number of established products on the market for wrapping data from web pages. One such approach is Lixto [1], a product of research performed at our institute. Our work is concerned with extending the wrapping func- tionality of Lixto to PDF documents. As the PDF format is relatively unstructured, this is a challenging task. We have developed a method to segment the page into blocks, which are represented as nodes in a relational graph. This paper describes our current research in the use of relational match- ing techniques on this graph to locate wrapping instances.



