Exam

To pass the course you have to write a personal (i.e., on your own) essay in which you convince the reader that you have mastered the course material. This essay should consist of three parts.

  • To a large extent, this course can be seen as an explanation of two papers by Matteo Riondato and Eli Upfal, viz.,
    • Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees
    • Mining frequent itemsets through progressive sampling with Rademacher averages
    In the first part of your essay you are to explain the main results of these papers - as discussed during the course. While explaining these results, you are bound to use other results and concepts we discussed during the course, e.g. Rademacher complexity. You should also explain those concepts; one could say that you have to write a recursive explanation.
  • Next to these two sampling regimes we discuss one more, by Toivonen. In the second part of your essay you should show, through experimental results, how two these three methods compare on data requirements with each other and with the original Apriori algorithm. What can you conclude from these experiments?
  • The first two parts cover a large part of the course, but not everything. For the third part you should choose 1 of the topics that is not covered in the first two parts and write an essay on that topic that goes beyond what we discussed in class. You could, e.g., write about the problem of induction (for the philosophically inclined), specific approaches to do induction anyway, such as Structural Risk Minimization, Algorithmic Information Theory, or the Minimum Description Length Principle. Or you could address algorithmic approaches such as Boosting or Krimp. You cannot write about the general problem of pattern set mining or solutions defined for that problem -- that is the topic of another course

For the first part you can/should use 5 - 7 pages. You can use Math, but it is not necessary. You should not give a verbatim list of definitions and theorems, but explain in your own words what these things mean - use formal notation only there where the exactness is necessary.

For the second part you can/should use 2 - 3 pages., illustrating your experimental results with graphs and/or tables. You can use any implementation you can find for frequent itemset mining or whatever you need; it is not about your programming skills, but about the experiments you do and the conclusions you draw.

For the third part you can/should use 3 - 5 pages. You should not copy or rewrite Wikipedia pages or some survey or introductory paper. Rather you should write in your own words what the topic entails and how, in your view, this helps to battle big data; for the philosophers this last point should be read as: ``what impact do your views on the problem of induction have on what we can learn from data''

Finally, please use a spellchecker and make sure that words mean what you think they mean.

For the first part there is a maximum of 4 points, for the second and third part you can score maximally 3 points each.

You should submit your essay by April 14 as an attachment to an email that has in the subject the string [ESSAY BIG DATA}, your name and your student number - please also provide a title page with the same information.

Slides at the end of the twelfth lecture discuss the exam in some more detail

[02/04] Some clarifications based on your questions:

  • If you write your essay in latex (which ius recommendend because of the formula's you might want to use), note that the standard article class has very large margins. It is perfectly OK to use, e.g. geometry, to get more reasonable margins
  • For the experiments, note that there are three parameters epsilon, delta, and Toivonen's mu -- you should, of course, vary all three
  • Using one dataset for the experiments is insufficient to draw any conclusions, use at least two
  • Note that the UCI data sets are pretty small, use the blow-up trick from the progressive sampling paper to get large enough data sets. You only have to do this if you compute the need for a sample that is bigger than the original data set -- using some of the bigger data sets (so that you do not need to blow up) is, of course, better
  • If you compare tow (large) lists of frequent itemsets, using the naive quadratic algorithm may hurt you. Use something smarter, e.g., hashing
  • The page limits for the three different parts are hard constraints. 5 - 7 means no fewer than 5 and no more than 7. The reasons for these constraints are multifold -- I can write this in 5 pages, but not in fewer, more than 7 pages would allow you to blindly copy much of the slides -- which does not show comprehension on your part. Just as important, if all of you write larger essays, it becomes even more unlikely that I can grade all of them in time
  • If you essay is insufficient, I'll give you hints on how to improve it so that you can pass the course. You can then resubmit in the re-exam week for this course.