Conference paper

BURGET Radek. Layout Based Information Extraction from HTML Documents. In: 9th International Conference on Document Analysis and Recognition ICDAR 2007. Curitiba: IEEE Computer Society, 2007, pp. 624-629. ISBN 0-7695-2822-8.
Publication language:english
Original title:Layout Based Information Extraction from HTML Documents
Title (cs):Extrakce informace z HTML dokumetnů založená na rozložení stránky
Pages:624-629
Proceedings:9th International Conference on Document Analysis and Recognition ICDAR 2007
Conference:9th International Conference on Document Analysis and Recognition
Place:Curitiba, BR
Year:2007
ISBN:0-7695-2822-8
Publisher:IEEE Computer Society
Keywords
page segmentation, layout analysis, information extraction
Annotation
We propose a method of information extraction from HTML documents based on modelling the visual information in the document. A page segmentation algorithm is used for detecting the document layout and subsequently, the extraction process is based on the analysis of mutual positions of the detected blocks and their visual features. This approach is more robust that the traditional DOM-based methods and it opens new possibilities for the extraction task specification.
BibTeX:
@INPROCEEDINGS{
   author = {Radek Burget},
   title = {Layout Based Information Extraction from HTML Documents},
   pages = {624--629},
   booktitle = {9th International Conference on Document Analysis and
	Recognition ICDAR 2007},
   year = {2007},
   location = {Curitiba, BR},
   publisher = {IEEE Computer Society},
   ISBN = {0-7695-2822-8},
   language = {english},
   url = {http://www.fit.vutbr.cz/research/view_pub.php?id=8403}
}

Your IPv4 address: 54.146.47.178
Switch to IPv6 connection

DNSSEC [dnssec]