Experimentation Project

Title Database Characterisation by Visualisation
Student ?
Supervisors Matthijs van Leeuwen and Jilles Vreeken
ECTS 7.5 or 15
Related Course(s) Advanced Data Mining

In data mining, frequent pattern mining is a commonly used technique to discover often occurring patterns in a database. However, there often exists an enormous amount of such patterns: it is not unusual that their number is several orders of magnitude larger than the number of transactions in the database. In other words, analysis of the frequent pattern set extracted from the data is infeasible for human experts.

To solve this problem, we recently developed the KRIMP algorithm which actually characterises the data with very few patterns. From the huge amounts of frequent item sets, it picks very small sets that together describe the data very well. For this, the Minimum Description Length (MDL) principle is used: a pattern is kept only if it helps to better compress the database. The end result is a code table, consisting of item sets and their codes. The more often an item set occurs in the database, the shorter its code (and vice versa). Experiments have shown that a code table induced by KRIMP captures the data distribution of a database very well. We have shown that these code tables can be used for different purposes: classification, characterising differences between databases, data generation and privacy preservation.

While individual patterns can be easily analysed by an expert, interactions within a full pattern group are much more difficult to envision. This calls for easy interpretable representations which can be shown to experts. Visualisation would be the prime candidate for this, as good visualisations of a database and its code table provide more insight into the data. This would allow both the data miner and expert to learn a lot about both the data and the algorithm.

The goal of this project is therefore to implement and test an application that takes a database and code table and visualise these. Several types of visualisations can be thought of, visualising a couple of concepts: the code table (item sets and their codes), a database (item sets)
and a database cover (the encoding of a database by a code table). We have several ideas for possible visualisations, but it would be good if you could come up with more.

Depending on the progress made within the project and time available, numerous extensions are possible. Probably one of the most interesting extensions is to add the possibility to visualise the difference between multiple databases and code tables. This would be an extension of the work we recently published as "Characterising the Difference", for which we built (by hand) a couple of example visualisations that really show what the differences between the databases are.

The project has many possible real world applications, as characterising databases and differences between databases are very important topics. Good examples would be bioinformatics (academia, pharmaceutical companies) and sales/marketing.

Project requirements:

  • Write a stand-alone application with GUI that acts as post-processing tool for the (plain text) output that KRIMP generates.
  • The application has to be modular and extendible. It has to be possible to add, e.g., new visualisation types, in a later stage.
  • Implement several visualisations of a database / code table / cover.
  • Visualisations should be exportable in a vector file format, preferably EPS.
  • Suggested programming language: Java.

Recommended literature:

Vreeken, J., van Leeuwen, M. & Siebes, A. Characterising the Difference. In: Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2007 (KDD'07), pp 765-774, 2007.
Siebes, A., Vreeken, J. & van Leeuwen, M. Item Sets That Compress. In: Proceedings of the SIAM Conference on Data Mining 2006 (SDM'06), pp 393-404, 2006.