Department of Information and Computing Sciences

Departement Informatica Onderwijs
Bachelor Informatica Informatiekunde Kunstmatige intelligentie Master Computing Science Game&Media Technology Artifical Intelligence Human Computer Interaction Business Informatics

Onderwijs Informatica en Informatiekunde

Vak-informatie Informatica en Informatiekunde

Data wrangling

Studiepunten:14 ECTS
Periode:periode 1 (week 36 t/m 45, d.w.z. 3-9-2020 t/m 6-11-2020; herkansing week 1)
Deelnemers:tot nu toe 85 inschrijvingen
Rooster:De officiële roosters staan in MyTimetable
college          Hakim Qahtan
#EXTERN (Cogn.Science_dept:Kesteren)
innovatie          Hakim Qahtan
werkcollege          Ali Katsheh

In this this course, you will learn to:

  1. Know, explain, and apply data retrieval from existing relational and nonrelational databases, including text, using queries build from primitives such as select, subset, and join both directly in, e.g., SQL and through a rjson interface.
  2. Know, explain, and apply common data clean-up procedures, including missing data and the appropriate imputation methods and feature selection.
  3. Know, explain, and apply methodology to properly set-up data analysis experiments, such as train, validate, and test and the bias/variance trade-off.
  4. Know, explain, and apply supervised machine learning algorithms, both for classification and regression purposes as well as their related quality measures, such as AUC and Brier scores.
  5. Know, explain, and apply non-supervised learning algorithms, such as clustering and (other) matrix factorization techniques that may or may not result in lower-dimensional data representations.
  6. Be able to choose between the different techniques learned in the course and be able to explain why the chosen technique fits both the data and the research question best.

From the course catalogue:

  • Introduction to Statistical Learning (James et al.)
  • R for Data Science (Grolund & Wickham)
  • Data Science at the Command Line (Janssen)

Extra reading Material:

  • Abraham Silberschatz, Henry F. Korth, S. Sudarshan "Database System Concepts"
  • Wes McKinney "Python for Data Analysis"
  • Raghu Ramakrishnan, Johannes Gehrke "Database Management Systems"
  • Bleifuß, Tobias, Sebastian Kruse, and Felix Naumann. Efficient Denial Constraint Discovery with Hydra. Proceedings of the VLDB Endowment (PVLDB). 11(3):311-323, 2017
  • Loukides, M. "What is data science? The future belongs to the companies and people that turn data into products"
  • Jiawei Han, Micheline Kamber, Jian Pei "Data Mining: Concepts and Techniques"
  • Ian H. Witten, Eibe Frank "Data Mining: Practical Machine Learning Tools and Techniques"
  • Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze "An Introduction to information retrieval"
Werkvorm:Each week there are theoretical lectures that present the theories and give a general overview of the systems that are available. Then laboratory exercises give a hands-on experience where the user can practice the theory. These laboratories are performed with the assistance of TAs or the professor. The practical works done in these labs are drawn from some real life situations that allows the students to experience at first hand how to work data science problems.
Toetsvorm:The evaluation is done through weekly evaluation tests that check how much the knowledge of the lectures has been assimilated. Each week, there are 3 exams. The first one is given on every student which is considered the main exam. Then, on the next day, two additional exams are offered and the students attend one of them, depending on how they performed in the main exam. In particular, those who do not pass the main exam, they perform a retake exam to have the opportunity to perform better. Instead, students who performed well on main exam, will have the opportunity to attend a second exam of increased difficulty with the aim to achieve an honors status by offering them some extra points. So in total every week a student attends 2 exams.
Inspanningsverplichting voor aanvullende toets:In the case in which the student's score is between 4 and 5.5, the student may choose to perform a retake for up to two different topics.