To pass the course you have to write a personal (i.e., on your own) essay in which you convince the reader that you have mastered the course material. This essay should consist of two parts.

  • To a large extent, this course can be seen as an explanation of two papers by Matteo Riondato and Eli Upfal, viz.,
    • Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees
    • Mining frequent itemsets through progressive sampling with Rademacher averages
  • In the first part of your essay you are to explain the main results of these papers - as discussed during the course. While explaining these results, you are bound to use other results and concepts we discussed during the course, e.g. Rademacher complexity. You should also explain those concepts; one could say that you have to write a recursive explanation.
  • These papers use PAC learning to derive sampling bounds for frequent item set mining. In the introduction to frequent item set me already encountered a simple sampling result by Toivonen. Indeed, the reason to study PAC learning was to see whether or not we could improve on Toivonen's results. And that is what you have to do in part 2 of your essay: you have to report on experiments you have done to compare Toivonen's results with the results of the first paper by Riondato and Upfal

For the first part you can/should use 5 - 7 pages. You can use Math, but it is not necessary. You should not give a verbatim list of definitions and theorems, but explain in your own words what these things mean - use formal notation only there where the exactness is necessary.

For the second part you can/should use 3 - 4 pages., illustrating your experimental results with graphs and/or tables. You can use any implementation you can find for frequent itemset mining or whatever you need; it is not about your programming skills, but about the experiments you do and the conclusions you draw.

Finally, please use a spellchecker and make sure that words mean what you think they mean.

For both the first and the second part you can score upto 5 points.

You should submit your essay by April 20 as an attachment to an email that has in the subject the string [ESSAY BIG DATA}, your name and your student number - please also provide a title page with the same information.

For more details see the slides of Lecture 10.