Ing. Radek Burget, Ph.D.

Burget, R.: Information Extraction from HTML Documents Based on Logical Document Structure, Brno, CZ, FIT VUT, 2004, s. 85
Jazyk publikace:angličtina
Název publikace:Information Extraction from HTML Documents Based on Logical Document Structure
Název (cs):Extrakce informace z HTML dokumentů založená na logické struktuře dokumentu
Strany:85
Místo vydání:Brno, CZ
Rok:2004
Vydavatel:Fakulta informačních technologií VUT v Brně
URL:http://www.fit.vutbr.cz/~burgetr/thesis/burget_thesis.pdf [PDF]
URL:http://www.fit.vutbr.cz/~burgetr/thesis/burget_thesis.ps [PS]
Klíčová slova
Information Extraction, WWW, HTML, Logical Document Structure, Visual Information, Document Modeling
Anotace
The World Wide Web presents the largest Internet source of information from a broad range of areas. The web documents are mostly written in the Hypertext Markup Language (HTML) that doesn't contain any means for semantic description of the content and thus the contained information cannot be processed directly. Current approaches for the information extraction from HTML are mostly based on wrappers that identify the desired data in the document according to some previously specified properties of the HTML code. The wrappers are limited to a narrow set of documents and they are very sensitive to any changes in the document formatting. In this thesis, we propose a novel approach to information extraction that is based on modeling the visual appearance of the document. We show that there exist some general rules for the visual presentation of the data in documents and we define formal models of the visual information contained in a document. Furthermore, we propose the way of modeling the logical structure of an HTML document based on the visual information. Finally, we propose methods for using the logical structure model for the information extraction task based on tree matching algorithms. The advantage of this approach is certain independence on the underlying HTML code and better resistance to changes in the documents.
BibTeX:
@PHDTHESIS{
   author = {Radek Burget},
   title = {Information Extraction from HTML Documents Based on Logical
	Document Structure},
   pages = {85},
   year = {2004},
   location = {Brno, CZ},
   publisher = {Faculty of Information Technology BUT},
   language = {english},
   url = {http://www.fit.vutbr.cz/research/view_pub.php?id=7607}
}

Vaše IPv4 adresa: 107.21.186.38
Přepnout na IPv6 spojení

DNSSEC [dnssec]