Ing. Radek Burget, Ph.D.
| Burget, R.: Information Extraction from HTML Documents Based on Logical Document Structure, Brno, CZ, FIT VUT, 2004, s. 85 | | Jazyk publikace: | angličtina |
|---|
| Název publikace: | Information Extraction from HTML Documents Based on Logical Document Structure |
|---|
| Název (cs): | Extrakce informace z HTML dokumentů založená na logické struktuře dokumentu |
|---|
| Strany: | 85 |
|---|
| Místo vydání: | Brno, CZ |
|---|
| Rok: | 2004 |
|---|
| Vydavatel: | Fakulta informačních technologií VUT v Brně |
|---|
| URL: | http://www.fit.vutbr.cz/~burgetr/thesis/burget_thesis.pdf [PDF] |
|---|
| URL: | http://www.fit.vutbr.cz/~burgetr/thesis/burget_thesis.ps [PS] |
|---|
| Klíčová slova |
|---|
| Information Extraction, WWW, HTML, Logical Document Structure, Visual Information, Document Modeling |
| Anotace |
|---|
| The World Wide Web presents the largest Internet source of information
from a broad range of areas. The web documents are mostly written in
the Hypertext Markup Language (HTML) that doesn't contain any means for
semantic description of the content and thus the contained information
cannot be processed directly. Current approaches for the information
extraction from HTML are mostly based on wrappers that identify the
desired data in the document according to some previously specified
properties of the HTML code. The wrappers are limited to a narrow set
of documents and they are very sensitive to any changes in the document
formatting.
In this thesis, we propose a novel approach to information extraction
that is based on modeling the visual appearance of the document. We
show that there exist some general rules for the visual presentation
of the data in documents and we define formal models of the visual
information contained in a document. Furthermore, we propose the way of
modeling the logical structure of an HTML document based on the visual
information. Finally, we propose methods for using the logical
structure model for the information extraction task based on tree
matching algorithms. The advantage of this approach is certain
independence on the underlying HTML code and better resistance to
changes in the documents. |
| BibTeX: |
|---|
@PHDTHESIS{
author = {Radek Burget},
title = {Information Extraction from HTML Documents Based on Logical
Document Structure},
pages = {85},
year = {2004},
location = {Brno, CZ},
publisher = {Faculty of Information Technology BUT},
language = {english},
url = {http://www.fit.vutbr.cz/research/view_pub.php?id=7607}
} |
|