FITLayout Web Page Analysis Framework

FITLayout has evolved!

See the new FITLayout/2 at GitHub

The web page below refers to an outdated version of FITLayout. The new version includes a completely reworked Java API, a web GUI and much more. The complete documentation is available at GitHub.

FitLayout is an extensible web page segmentation and analysis framework written in Java. It defines a generic Java API for representing a rendered web page and its division to visual areas and it provides a base for implementing page segmentation algorithms with a common application interface. As a sample segmentation method, it implements a previously published segmentation algorithm based on recursive visual area merging and separator detection. The framework includes tools for post-processing the segmentation result by different text or visual classification methods. Finally, it also provides tools for controlling the segmentation process and examining the segmentation results through a graphical user interface. The segmentation result may be stored as RDF data for later analysis.

The FitLayout framework consists of the following modules (hosted on GitHub):

The API provides the common interface used for connection all the remaining modules. The framework allows to provide a custom implementation or extensions of any of the remaining modules. This allows replacing the default CSSBox rendering engine by another implementation and mainly, to implement and test custom page segmentation methods.

FitLayout API

See also the FITLayout manual and the generated Javadoc documentation.

The API is based on an ontological description of the processed page as published in [1]. The related ontology is displated below. For each class from the ontology, the API defines a Java interface with the corresponding properties. These interfaces are avaliable in the org.fit.layout.model package.

API ontologies: A) box ontology B) visual area ontology C) tagging

Moreover, default implementations of the model interfaces are available in the org.fit.layout.impl package.

CSSBox bindings

CSSBox is used as the default rendering engine. The CSSBox bindings provide an implementation of the box tree source. Based on a given URL, the rendering engine is used to create a set of boxes that correspond to the individual pieces of document content positioned in the page.

Segmentation

The segmentation module provides a generic framework for implementing web page segmentation algorithms. It represents the page as a tree of detected visual areas and defines an interface for implementing custom tree processing methods (operators) that implement the actual segmentation.

The default page segmentation algorithm is based on [2]. It uses a bottom-up approach that merges the individual boxes to larger visual areas. However, the FITLayout framework is open to adding other segmentation algorithms.

The result of the page segmentation is a tree of visual areas that are represented as the Area objects.

Classification

The individual visual areas detected during the page segmentation may be tagged with arbitrary tags. The classification package implements tagging the areas by classification of their visual properties as proposed in [3] or textual properties [4].

Tools

The tools allow to run the segmentaion on the given URL and examine the results by a GUI browser.

Visual area browser GUI

References

  1. MILIČKA Martin a BURGET Radek. Information Extraction from Web Sources based on Multi-aspect Content Analysis. In: Semantic Web Evaluation Challenges, SemWebEval 2015 at ESWC 2015. Portorož: Springer International Publishing, 2015, s. 81-92. ISBN 978-3-319-25517-0. ISSN 1865-0929.
  2. MILIČKA Martin and BURGET Radek. Web Document Description Based on Ontologies. In: Proceedings of the 2nd annual conference ICIA 2013. Łódź: The Society of Digital Information and Wireless Communications, 2013, pp. 288-293. ISBN 978-1-4673-5255-0.
  3. BURGET Radek. Layout Based Information Extraction from HTML Documents. In: 9th International Conference on Document Analysis and Recognition ICDAR 2007. Curitiba: IEEE Computer Society, 2007, pp. 624-629. ISBN 0-7695-2822-8.
  4. BURGET Radek a RUDOLFOVÁ Ivana. Web Page Element Classification Based on Visual Features. In: 1st Asian Conference on Intelligent Information and Database Systems ACIIDS 2009. Dong Hoi: IEEE Computer Society, 2009, pp. 67-72. ISBN 978-0-7695-3580-7.
  5. BURGET Radek and SMRŽ Pavel. Extracting Visually Presented Element Relationships from Web Documents. International Journal of Cognitive Informatics and Natural Intelligence. Hershey: IGI Global, 2013, vol. 2013, no. 2, pp. 13-29. ISSN 1557-3958.

License

FITLayout is available under the terms of the GNU General Public License.

Acknowledgements

This work was supported by the BUT FIT grant FIT-S-14-2299 and the IT4Innovations Centre of Excellence CZ.1.05/1.1.00/02.0070.