Panel s nápovědou

Options of document representation

A document representation consists of features and a model. This application enables to use the following two types of features:
  1. Words - there are words occuring in the document used as features.
  2. N-grams - there are word phrases consisting of N words (so called N-grams)used as features. The used can set the maximum size of N-gram and a limit of minimum occurence of an N-gram in a document.
and the following models:
  1. Binary repesentation - stores a "1" value into the feature vector, if a word occurs in a document, otherwise, "0" value is stored.
  2. TF model - stores a term frequency int the feature vector. This is computed from a occurence count of a term in a document, normalized by a number of all terms in a document.
  3. Model TF-IDF - takes a term frequency into consideration together with inverse document frequency, which is deduced from occurence count of a term in all documents in a dataset.