Panel s nápovědou

Format of input data

In the field "File formats" it is possible to choose from the following options:
  1. Text documents
    Each text file is considered as one document. The document class for text classification is taken form the name of directory, in which the document is stored.
  2. Reuters-21578
    Files of the Reuters dataset are in the SGML format. Each of the files reut2-000.sgm - reut2-020.sgm contains 1000 documents. Last file reut2-021.sgm contains 578 documents.
    There are three types of documents in the Reuters dataset, labeled as NORM, BRIEF and UNPROC. Dokuments labled as NORM and BRIEF are standard text documents (BRIEF are shorter ones), which are precessed by the program. Dokuments labeled as UNPROC are not text documents (e.g. tables), that's why they are not processed.
Caution: if you are processing Reuters files and "Text documents" format is chosen, processing will take a long time, because each Reuters file size exceeds 1MB.