FITLayout Framework Manual

Radek Burget
burgetr@fit.vutbr.cz

Table of Contents

Introduction

FITLayout is an extensible web page segmentation and analysis framework written in Java. It defines a generic Java API for representing a rendered web page and its division to visual areas and their further analysis. It also provides a base for implementing page segmentation algorithms with a common application interface. The framework includes tools for post-processing the segmentation result by different text or visual classification methods. It also provides tools for controlling the segmentation process and examining the segmentation results through a graphical user interface.

Architecture of the Framework

FITLayout operates on a rendered page represented by a box tree. The box tree is obtained by rendering the page and calculating the positions, fonts, colors and other visual features of the indivudual pieces of contents (boxes). The box tree represents an input of the page segmentation algorithms.

Page segmentation is the main task implemented in FITLayout. It analyzes the input boxtree and produces a tree of visual areas that correspond to the detected visual blocks in the page. The created visual area tree may be further processed by area tree operators that represent independent post-processing steps of the segmentation. These steps may change the organization of the resulting tree of visual areas, e.g. group several nodes to new areas, etc.

The process of page rendering and segmentation may be controlled using a provided set of tools. These tools include a visual browser with a graphical user interface that can be used for configuring and executing the individual tasks. Moreover a scriptable processor is provided that allows to use JavaScript for running the tasks in batch mode.

Modules

The FitLayout framework consists of the following basic modules:

There exist some more additional modules that will be described later.

The API (cssbox-api) module provides a shared API common for all the remaining modules. It provides the following basic Java packages:

The details about the individual available interfaces are given in the appropriate sections below.

Services

The FITLayout architecture is easily extensible by creating new plugins providing new functionality such as new box tree sources (document renderers), segmentation algorithms, area tree post-processing operators or GUI extensions. The plugins use the standard Java Extensible applications framework.

The following types of services are recognized:

Each service is identified by its unique identifier obtained using its getId() method. All the services may accept some input parametres. They implement a ParametrizedOperation interface that allows to get the information about the required input parametres (their names and types) and to assign the values to them.

For accessing the services, FITLayout provides a simple ServiceManager that provides static methods for locating the services of the given types.

Box Tree

The whole rendered page is represented using a Page object. Its getRoot() method obtains a root node of the box tree that represents the page contents. The nodes of the box tree are formed by the Box objects that represent the individual rendered boxes. Each box has a fixed position in the rendered page obtained using the getBounds() method and some more visual properties such as font size, colors, etc. The related methods are defined by a shared ContentRect interface.

The getType() method obtains the box type which is one of the following:

The boxes are organized in a hierarchical structure. The getParentBox(), getChildBox() and getChildCount() methods may be used for traversing the hierarchy. The TEXT_CONTENT and REPLACED_CONTENT boxes are always the leaf nodes of the tree. The ELEMENT nodes may exist anywhere in the tree.

CSSBox Box Source

The default box source is implemented in the layout-cssbox (CSSBox bindings) module as the CSSBoxTreeProvider class. The individual boxes are represented using the BoxNode objects. The CSSBox box source renders an input document identified by its URL. It supports the HTML/CSS and PDF documents. It is based on the open-source CSSBox rendering engine.

Segmentation

The segmentation algorithm takes a box tree on its input and it produces a tree of visual areas. The resulting tree is represented by a AreaTree object. Its getRoot() methods obtains a root node of the area tree that represents the segmentation result. Each node of the area tree is represented using an Area object that corresponds to a visual area detected in the page. The root node corresponds to the whole page area, the descendant nodes correspond to smaller detected areas. The leaf areas may contain the actual boxes from the box tree that represent the contents of the area.

The nodes provide the basic tree navigation and manipulation methods similary as for the box tree. All these functions are specified by a shared AreaTreeNode interface.

The position of the area in the rendered page and all its visual features such as fonts and colors may be obtained throught the implemented ContentRect the same way as for the individual boxes. However, since the contained boxes may have different visual properties (e.g. different font sizes), the corresponding methods for the visual area (such as getFontSize()) return the average values for the whole area.

Optionally, the mutual positions of the areas within its parent area may be described by an arbitrary topology. A typical example is a gird topology that represents the area positions using a flexible grid. The position of each area in the topology may be obtained using the getTopology method and is represented using a generic AreaTopology interface.

Default Extensible Segmentation Algorithm

The default segmentation algorithm implementation is contained in the segmentation module. It works in the following steps:

  1. The tree of basic visual areas is created. With a basic visual area, we understand the area formed by any box from the source box tree that is visually separated from its neighborhood. Generally the following boxes are considered to be visually separated:
    • The root box.
    • The boxes that directly contain a text.
    • The boxes that have a background different from its neighborhood or have a visible border.
    For each visually separated box, a corresponding area is created in the tree of basic visual areas.
  2. The tree is processed by selected area tree operators. Severals area tree operators are available for performing common segmentation tasks such as concatenating text lines or finding larger groups or boxes. See the implementations of the AreaTreeOperator interface for a reference.
  3. The area nodes in the processed area tree are represented using a custom AreaImpl class.

Tools

The tools module provides the tools for running and controlling the segmenation process. The Processor that implements the whole segmentation process and the BlockBrowser with the graphical user interface.

Processor

The processor is a class that is responsible for executing the complete segmentation process, i.e. for creating the tree of basic visual areas and to apply the configured operators on that tree. The basic functionality is defined in an abstract BaseProcessor class. There are two implementations available:

GUI Browser

The BlockBrowser implements the default browser with a Swing GUI. It lets the user to choose the box tree provider, the area tree provider and to configure the applied operators. For executing the segmentation, an instance of the GUIProcessor is used.