Master Thesis Project

Title Data Mining with Monotonicity Constraints
Supervisor Ad Feelders
Related Course(s)  
Description

A common form of prior knowledge in data mining concerns the monotonicity of relations between the dependent and explanatory variables. For example, economic theory would state that, all else equal, people tend to buy less of a product if its price increases. The precise functional form of this relationship is however not specified by theory.

Monotonicity may also be an important model requirement with a view toward explaining and justifying decisions, such as acceptance/rejection decisions. Consider for example a university admission procedure where candidate A scores at least as good on all admission criteria as candidate B, but A is refused whereas B is admitted. Such a nonmonotone admission rule would clearly be unacceptable. Similar considerations apply to selection procedures for applicants for e.g. a job or a loan.

Although the monotonicity requirement is quite common in practice, most data mining algorithms are not able to enforce such a constraint. This is also true for common classification and regression tree algorithms such as C4.5/C5 and CART. We have developed an algorithm for learning monotone classification trees from monotone as well as nonmonotone data (see M. Pardoel: Data Mining with classification trees for problems with monotonicity constraints. INF/SCR-03-15, May 2003). We want to extend this work in several directions:

  • Currently, the algorithm can only be applied to problems for which there is a monotonicity constraint on all variables. For many problems this requirement is too strong: usually there are one or more variables for which the constraint does not apply. The algorithm should be extended to be able to handle such problems.
  • Another interesting extension would be to consider monotone regression trees. Many problems we analysed with monotone classification trees are in fact regression problems, i.e. have a numeric dependent variable. Regression trees are suited to handle such data mining problems.
The project consist of the following activities:
  • Development and implementation of algorithms to learn classification and regression trees with a partial monotonicity constraint.
  • Empirical testing of their performance on real life data sets.
In your work, you will be able to build on the results of the project Data Mining with classification trees for problems with monotonicity constraints. The algorithms should preferrably be implemented in the S (or R) language. If you are familiar with an imperative programming language, you will be able to master the R language quite easily.