Lecturer: prof. dr Arno Siebes

Contact: if you have any questions, please ask them in the lecture hall: before the lecture starts, when it has finished, during the break, or during the lecture. Per day I get more email than I can possibly answer, there are no guarantees that I will answer any email you send me.

## The Course

One of the characteristic problems of Big Data is that the volume of data requires sub-linear algorithms to process it. Sub-linear often simply means that you sample the data. But how do you sample the data if you don't know the distribution? PAC learning is an answer to that problem. The problem does not only depend on the data distribution, but also on the data mining problem you want to solve. To a large extend this course aims to solve the problem how to sample for frequent item set mining. Other topics are addressed because they illustrate the power of the PAC learning framework or because they have a direct bearing on the frequent item set mining problem. From a completely different point of view, one could see this as a course on computational philosophy: how to deal with the problem of induction.

This is a tough course, we discuss many theorems and their proofs. Handling Big Data in a sound way requires firm foundations. But, note that it is about understanding this formal framework rather than being able to reproduce it.

## Announcements

- [22/05] I am deeply sorry that you have to wait far too long to get a grade for the course. The problem is that I have been tasked with the formalities of starting a new Master program on Data Science. Since this is a collaboration between 5 faculties, it is a bit more complicated, and thus time-consuming, than anticipated. The good news is that between now and June 1 a miracle will occur and the formalities will be over; at least for me. That means that from then on I have time to finalize the grading of your essays, you can expect your results in the week of June 5.
- [02/04] The slides of the final lecture have been posted online, I am sorry for the delay
- Based on the questions I got this week I have posted some further remarks on the essay on the exam page
- [17-03] The requirements for your essay have changed slightly, see also the slides at the end of the twelfth lecture
- [17-02] The slides for the first three lectures have been updated. Mostly small errors, two more serious ones
- On the slide for conditional expectation, in the second lecture, a P(Fi) factor was inadvertendly missing, commenting lines out in Latex is dangerous
- On the slide called "Misleading Samples, Bad Hypotheses" there was a =< which should have been a >

- [07-02] The website is online. Details on the exam can be found here, pointers for (more) literature here and the slides can be found here