Master Thesis Project

Title Multi-Relational Data Mining
Supervisor Arno Knobbe
Related Course(s) Advanced Data Mining

Multi-Relational Data Mining (MRDM) is the branch of Data Mining which deals with analysing structured data, or in other words, data which doesn't naturally fit in a single table but requires the power of a full relational database with multiple tables. There are a number of interesting applications (including for example bio-informatics and chemistry) that require a MRDM approach because traditional Data Mining techniques are not powerful enough. The MRDM field has reached a relative maturity over the last years. Still there is a number of interesting, but unanswered questions that have been overlooked by the mainstream research. This graduation project involves doing experiments and writing new algorithms in order to better understand or solve (some of) these questions.

Below is a list of some examples of issues that are not satisfactorily solved in MRDM.

  • Greedy search. There are some indications that a common feature of common Data Mining algorithms, greedy search for interesting patterns, doesn't work as naturally and succesfully in MRDM as could be expected from common DM approaches. Most current MRDM techniques do use a form of greedy search which means that their results may be suboptimal. A more extensive form of search may find better patterns.
  • Numeric data. MRDM algorithms typically are poor in handling numeric data. Some techniques exist but a more thorough treatment is necessary.
  • Discretisation. One way to work with numeric data is to replace them by nominal values such as low, medium, high. This works well in common DM, but is far from trivial in MRDM because you have multiple tables, and as such the distribution of numbers in one table may be different from that in others. In fact you need to replace groups of numbers with a nominal value rather than a single number.
  • Decision lists. There is a difference between rules and decision trees in MRDM. The former are much more easy to produce than the latter. Decision lists are right between these two. In a way they are (linear) trees (if C1 then ... else if C2 then ... else if C3 etc.). Yet they can be produced using techniques for finding rules if you make some assumptions.
  • Propositionalisation. An interesting way of analysing a multi-relational database is to summarize all of the data in the different tables into a single table, and then apply a regular DM algorithm. Of course this process may lead to some loss of information, and it is a challenge to produce a summarisation that is rich enough and yet small enough to be handled efficiently.