Thesis Details

Čištění, extrakce textu a převod webových stránek do vertikálního formátu

Bachelor's Thesis Student: Švaňa Miloš Academic Year: 2015/2016 Supervisor: Dytrych Jaroslav, Ing., Ph.D.
English title
Cleaning, extraction of text and transformation of web pages into vertical format
Language
Czech
Abstract

This thesis deals with the topic of extraction of text from web page, recognition of important contents and its transformation to vertical format, which can be used as a suitable input for other natural language processing tasks. It analyzes the existing solution and its components with emphasis on its disadvantages and describes the design and implementation of new solution based on obtained knowledge.

Keywords

Vertcalization, web, CommonCrawl, text extraction, Justext, Boilerpipe, text classification, natural language processing.

Department
Degree Programme
Information Technology
Files
Status
defended, grade A
Date
15 June 2016
Reviewer
Committee
Zendulka Jaroslav, doc. Ing., CSc. (DIFS FIT BUT), předseda
Grézl František, Ing., Ph.D. (DCGM FIT BUT), člen
Kotásek Zdeněk, doc. Ing., CSc. (DCSY FIT BUT), člen
Matoušek Petr, doc. Ing., Ph.D., M.A. (DIFS FIT BUT), člen
Orság Filip, Ing., Ph.D. (DITS FIT BUT), člen
Citation
ŠVAŇA, Miloš. Čištění, extrakce textu a převod webových stránek do vertikálního formátu. Brno, 2016. Bachelor's Thesis. Brno University of Technology, Faculty of Information Technology. 2016-06-15. Supervised by Dytrych Jaroslav. Available from: https://www.fit.vut.cz/study/thesis/18729/
BibTeX
@bachelorsthesis{FITBT18729,
    author = "Milo\v{s} \v{S}va\v{n}a",
    type = "Bachelor's thesis",
    title = "\v{C}i\v{s}t\v{e}n\'{i}, extrakce textu a p\v{r}evod webov\'{y}ch str\'{a}nek do vertik\'{a}ln\'{i}ho form\'{a}tu",
    school = "Brno University of Technology, Faculty of Information Technology",
    year = 2016,
    location = "Brno, CZ",
    language = "czech",
    url = "https://www.fit.vut.cz/study/thesis/18729/"
}
Back to top