FITLayout is an extensible web page segmentation and analysis framework written in Java. It defines a generic Java API for representing a rendered web page and its division to visual areas and their further analysis. It also provides a base for implementing page segmentation algorithms with a common application interface. The framework includes tools for post-processing the segmentation result by different text or visual classification methods. It also provides tools for controlling the segmentation process and examining the segmentation results through a graphical user interface.
FITLayout operates on a rendered page represented by a box tree. The box tree is obtained by rendering the page and calculating the positions, fonts, colors and other visual features of the indivudual pieces of contents (boxes). The box tree represents an input of the page segmentation algorithms.
Page segmentation is the main task implemented in FITLayout. It analyzes the input boxtree and produces a tree of visual areas that correspond to the detected visual blocks in the page. The created visual area tree may be further processed by area tree operators that represent independent post-processing steps of the segmentation. These steps may change the organization of the resulting tree of visual areas, e.g. group several nodes to new areas, etc.
The process of page rendering and segmentation may be controlled using a provided set of tools. These tools include a visual browser with a graphical user interface that can be used for configuring and executing the individual tasks. Moreover a scriptable processor is provided that allows to use JavaScript for running the tasks in batch mode.
The FitLayout framework consists of the following basic modules:
There exist some more additional modules that will be described later.
The API (cssbox-api) module provides a shared API common for all the remaining modules. It provides the following basic Java packages:
org.fit.layout.model
– basic java interfaces used for
representing the rendered page (a box tree) and the result of segmentation (an area tree).org.fit.layout.impl
– default implementations of the
interfaces from the model
package. These implementations may be used as a starting
point for further extension in applications.org.fit.layout.api
– interfaces specific for the FITLayout
framework itself. They include the services of different kinds as described below.org.fit.layout.gui
– common interfaces of a GUI browser
used for monitoring the page processing.The details about the individual available interfaces are given in the appropriate sections below.
The FITLayout architecture is easily extensible by creating new plugins providing new functionality such as new box tree sources (document renderers), segmentation algorithms, area tree post-processing operators or GUI extensions. The plugins use the standard Java Extensible applications framework.
The following types of services are recognized:
BoxTreeProvider
–
a box tree source; i.e. the page renderer. Based on the input parameters (e.g. the page URL),
it renders the page and produces the box tree.
AreaTreeProvider
–
an area tree source; i.e. a basic segmentation algorithm. It gets a box tree on its input
and produces a visual area tree that represents the segmented page.
AreaTreeOperator
–
a post-processing operation applied on the visual area tree. It may perform any operation
with the tree such as joining nodes, splitting nodes, extending the hierarchy, etc.
LogicalTreeProvider
–
an analyzer that gets the final area tree on its input and assigns semantics to
selected areas (tree nodes).
Each service is identified by its unique identifier obtained using its
getId()
method.
All the services may accept some input parametres. They implement a
ParametrizedOperation
interface
that allows to get the information about the required input parametres (their names and types)
and to assign the values to them.
For accessing the services, FITLayout provides a simple
ServiceManager
that provides
static methods for locating the services of the given types.
The whole rendered page is represented using a Page
object. Its getRoot()
method
obtains a root node of the box tree that represents the page contents. The nodes of the box
tree are formed by the Box
objects that represent the
individual rendered boxes. Each box has a fixed position in the rendered page obtained using the
getBounds()
method and
some more visual properties such as font size, colors, etc. The related methods are defined
by a shared ContentRect
interface.
The
getType()
method obtains
the box type which is one of the following:
ELEMENT
– a box generated by a DOM elementTEXT_CONTENT
– a box representing a displayed textREPLACED_CONTENT
– a box representing a replaced content (an image or other object)The boxes are organized in a hierarchical structure. The
getParentBox()
,
getChildBox()
and
getChildCount()
methods may be used for traversing the hierarchy. The TEXT_CONTENT
and REPLACED_CONTENT
boxes are always the leaf nodes of the tree. The ELEMENT
nodes may exist anywhere
in the tree.
The default box source is implemented in the layout-cssbox
(CSSBox bindings) module as the CSSBoxTreeProvider
class. The individual boxes are represented using the
BoxNode
objects. The CSSBox box source renders
an input document identified by its URL. It supports the HTML/CSS and PDF documents. It is based on
the open-source CSSBox rendering engine.
The segmentation algorithm takes a box tree on its input and it produces a tree of visual areas. The resulting tree
is represented by a AreaTree
object. Its
getRoot()
methods obtains a root node
of the area tree that represents the segmentation result. Each node of the area tree is represented using
an Area
object that corresponds to a visual area detected in the page.
The root node corresponds to the whole page area, the descendant nodes correspond to smaller detected areas. The leaf
areas may contain the actual boxes from the box tree that represent the contents of the area.
The nodes provide the basic tree navigation and manipulation methods similary as for the box tree. All these functions
are specified by a shared AreaTreeNode
interface.
The position of the area in the rendered page and all its visual features such as fonts and colors may be obtained
throught the implemented ContentRect
the same way as for the
individual boxes. However, since the contained boxes may have different visual properties (e.g. different font sizes),
the corresponding methods for the visual area (such as getFontSize()
)
return the average values for the whole area.
Optionally, the mutual positions of the areas within its parent area may be described by an arbitrary topology.
A typical example is a gird topology that represents the area positions using a flexible grid. The position of each
area in the topology may be obtained using the getTopology
method and is represented using a generic AreaTopology
interface.
The default segmentation algorithm implementation is contained in the segmentation module. It works in the following steps:
AreaTreeOperator
interface for a reference.The area nodes in the processed area tree are represented using a custom AreaImpl
class.
The tools module provides the tools for running
and controlling the segmenation process. The Processor
that implements the whole
segmentation process and the BlockBrowser
with the graphical user interface.
The processor is a class that is responsible for executing the complete segmentation process, i.e.
for creating the tree of basic visual areas and to apply the configured operators on that tree. The basic
functionality is defined in an abstract BaseProcessor
class. There are two implementations available:
ScriptableProcessor
uses JavaScript
for configuring the area operators that should be applied.GUIProcessor
where the configuration
of the operators may be modified from outside (typically by the GUI browser).The BlockBrowser
implements the default
browser with a Swing GUI. It lets the user to choose the box tree provider, the area tree provider
and to configure the applied operators. For executing the segmentation, an instance of the
GUIProcessor
is used.