Our capability to generate and store data has been increasing rapidly in the last years. It is not a problem to store terabytes of data any more. The problem is to melt these huge amounts of relatively primitive information to human-understandable forms -- patterns and knowledge. Unfortunately, we are not able to perform this task just by ourselves as the amounts of data are simply too large for our brains to process them. Fortunately, the field of knowledge discovery in databases (KDD) offers a solution: it aims at automated and intelligent extraction of patterns representing implicit knowledge encoded in massive data repositories (databases, data warehouses, WWW, etc.).
Probably the most crucial step in the whole KDD process is the data preparation. Surprisingly, it does not receive much attention among the data mining community, and this thesis tries to fill the gap. We introduce a theoretical framework for the data preparation step of the KDD process, and present an XML vocabulary named the Data Mining Specification Language (DMSL) that is centered around the framework. The wider purpose of DMSL is to provide for platform-independent definition of the whole KDD process, and its exchange and sharing among different applications, possibly operating in heterogeneous environments.
Here is the zipped thesis in the Adobe PDF format: pdfthesis.zip (720 KB)
Here is the zipped thesis in the postscript format: psthesis.zip (1065 KB)
Here is the zipped short version of the thesis in the Adobe PDF format: pdfthesisshort.zip (170 KB)
Here is the zipped short version of the thesis in the postscript format: psthesisshort.zip (232 KB)
Here are zipped DTDs and examples: thesisfiles.zip (15 KB)
Back to Petr Kotasek's homepage.