
Options of document representation
A document representation consists of features and a model. This application
enables to use the following two types of features:
- Words -
there are words occuring in the document used as features.
- N-grams -
there are word phrases consisting of N words (so called N-grams)used as
features. The used can set the maximum size of N-gram and a limit of
minimum occurence of an N-gram in a document.
and the following models:
- Binary repesentation -
stores a "1" value into the feature vector, if a word occurs in a document,
otherwise, "0" value is stored.
- TF model -
stores a term frequency int the feature vector. This is computed from
a occurence count of a term in a document, normalized by a number of
all terms in a document.
- Model TF-IDF -
takes a term frequency into consideration together with inverse document
frequency, which is deduced from occurence count of a term in all documents
in a dataset.