|Title||Database Characterisation by Visualisation|
|Supervisors||Matthijs van Leeuwen and Jilles Vreeken|
|ECTS||7.5 or 15|
|Related Course(s)||Advanced Data Mining|
In data mining, frequent pattern mining is a commonly used technique to discover often occurring patterns in a database. However, there often exists an enormous amount of such patterns: it is not unusual that their number is several orders of magnitude larger than the number of transactions in the database. In other words, analysis of the frequent pattern set extracted from the data is infeasible for human experts.
To solve this problem, we recently developed the KRIMP algorithm which actually characterises the data with very few patterns. From the huge amounts of frequent item sets, it picks very small sets that together describe the data very well. For this, the Minimum Description Length (MDL) principle is used: a pattern is kept only if it helps to better compress the database. The end result is a code table, consisting of item sets and their codes. The more often an item set occurs in the database, the shorter its code (and vice versa). Experiments have shown that a code table induced by KRIMP captures the data distribution of a database very well. We have shown that these code tables can be used for different purposes: classification, characterising differences between databases, data generation and privacy preservation.
While individual patterns can be easily analysed by an expert, interactions within a full pattern group are much more difficult to envision. This calls for easy interpretable representations which can be shown to experts. Visualisation would be the prime candidate for this, as good visualisations of a database and its code table provide more insight into the data. This would allow both the data miner and expert to learn a lot about both the data and the algorithm.
The goal of this project is therefore to implement and test an application that takes a database and code table and visualise these.
Several types of visualisations can be thought of, visualising a couple of concepts: the code table (item sets and their codes),
a database (item sets)
Depending on the progress made within the project and time available, numerous extensions are possible. Probably one of the most interesting extensions is to add the possibility to visualise the difference between multiple databases and code tables. This would be an extension of the work we recently published as "Characterising the Difference", for which we built (by hand) a couple of example visualisations that really show what the differences between the databases are.
The project has many possible real world applications, as characterising databases and differences between databases are very important topics. Good examples would be bioinformatics (academia, pharmaceutical companies) and sales/marketing.