Department of Information and Computing Sciences

Departement Informatica Onderwijs
Bachelor Informatica Informatiekunde Kunstmatige intelligentie Master Computing Science Game&Media Technology Artifical Intelligence Human Computer Interaction Business Informatics

Onderwijs Informatica en Informatiekunde

Vak-informatie Informatica en Informatiekunde

Data science and society

Website:website containing additional information
Course code:INFOMDSS
Credits:7.5 ECTS
Period:period 1 (week 36 through 45, i.e., 2-9-2019 through 8-11-2019; retake week 1 (bachelor) / 2 (master))
Participants:up till now 175 subscriptions
Schedule:Official schedule representation can be found in MyTimetable
lecture          Marco Spruit
Matthieu Brinkhuis
tutorial group 1        Metehan Doyran
Freark Westra
Samantha Warmerdam
Wouter Bruining
Please read the DSS 2019 Course Overview document for all relevant information.
Note that you need to login with your UU account to access this PDF file.
This is the starting and obligatory course for the Business Informatics (MBI) programme as well as the Applied Data Science profile. As such, its primary objective is to inspire and introduce you to the exciting domain of Applied Data Science. At the end of this course, you will be able to:
  1. Understand the role of data science and its societal impact
  2. Recognise the knowledge discovery processes in applied data science
  3. Identify trends and developments in big data technologies
  4. Apply selected big data technologies to solve real-world problems
  5. Analyse unstructured data using natural language processing techniques
  6. Understand the need for self-service data science
Literature:We provide PDFs for most if not all required literature.
Course form:This course contains two lectures per week on the themes described above, for which afterwards the slides will be made available on our MS Teams group.

Throughout the course, you are given a number of individual assignments. The answers to the assignments are to be submitted to the appropriate channel in our DSS 2019 Teams group.

To help you complete the assignments, this class is also supported by the DataCamp learning platform for R, Python and SQL and more, through a combination of short expert videos and hands-on-the-keyboard exercises.

Exam form:The final grade will be determined based on the following course components:
[A] Mid-term exam
[B] End-term exam
[C] Optional bonus for extraordinary participation/performance

Grade = [A]*0.50 + [B]*0.50 + [C]

Note that the minimum grade of each of these exams is a 5. If for one of the exams your grade is between a 4 and a 5, you can repair that specific exam during the “second chance” session. Note that it is not possible to repair both exams.

All course materials are examined, including all lecture slides, assignments and weekly readings.

Minimum effort to qualify for 2nd chance exam:Regarding your best efforts obligation (‘inspanningverplichting’), the course consists of 6 individual assignments. These assignments are not graded (due to staff capacity constraints), however, you have to complete and submit a minimal number of 5 to pass the course. Note that 1 repair assignment is available, if you have missed 2 assignments in total. Note that it is not possible to repair multiple assignments.

Although not graded, we do examine all submissions on completeness. Each assignment has a submission deadline of around 2 weeks after its release on the Teams group. Please check each assignment for the exact deadline. We do not accept late submissions: a 2 weeks submission window should provide sufficient time.

Finally, please use the VM on Azure Labs assigned to you to complete your assignments. We may randomly check VM statuses periodically to verify your progress.


Applied Data Science

The first theme is Applied Data Science (ADS) as positioned in (Braschler et al., 2019) and defined in (Spruit & Lytras, 2018) as “the knowledge discovery process in which analytic systems are designed and evaluated to improve the daily practices of domain experts”. Being the core theme of this course, we cover the need for data scientists (e.g. Davenport & Patil, 2012) and relate this novel topic with the well-known domain of knowledge discovery processes (Chapman et al., 2000). We refer to standardised NIST definitions (Pritzker & May, 2015) to properly ground our ADS perspective.

Data Analytics

Data analytics is the multidisciplinary field which aims to make sense of data and observations from everyday life. Its data-driven approach to problem solving includes various methods and techniques. In this theme we focus on discussing why certain approaches work, what common mistakes are made, and so on, using (Lazer et al., 2014; Broniatowski et al., 2014) as a running example. We will also discuss data analytics tasks from both statistical and machine learning perspectives.

Big Data & Cloud Computing

The original course trigger was the inability of researchers to analyse datasets which were simply too big to process on a laptop. On the one hand they can use someone else’s bigger computer (e.g. Cloud Computing) and on the other hand they can employ other data analysis techniques that are designed to be limitlessly scalable. The prime example of such an analysis technique is MapReduce, which we will discuss both from the original Hadoop perspective (Dean & Ghemawat, 2008) as well as from its successors within the increasingly more popular Spark environment (Chambers & Zacharia, 2018). Furthermore, we also note the more philosophical implications of Big Data technologies using (Ambrose, 2015). How do we know that we know? What are the epistemological implications of Big Data analyses on the theory of knowledge? Would a historical perspective be helpful?

Natural Language Processing

We introduce the field of Natural Language Processing (NLP) as a key technology within data science and artificial intelligence. Applications of NLP are everywhere where people communicate, including web search, scientific papers, emails, customer service, language translation, and clinical reports. Recently, deep learning approaches have obtained very high performance across many different NLP tasks. However, for decades NLP has mostly been based on symbolic approaches instead. Current NLP research aims to meaningfully integrate these two paradigms to better understand human language. Therefore, we will introduce you first to some classical linguistic theories before moving into more recent neural network-based NLP approaches, based on (Clark et al., 2013). Furthermore, the computational experiment assignment will allow you to experiment more in-depth with a state-of-the-art approach within this fast moving field of NLP.

Automated Machine Learning

As identified in (Spruit & Jagesar, 2016), one of the major challenges in correctly applying Machine Learning techniques in Applied Data Science projects is the so-called Selection vs Configuration dilemma. Often it is quite hard to select the best algorithm for a given data analysis task, and even harder to properly configure its (hyper-)parameters. Even for data scientists. One promising solution might be Automated Machine Learning (Hutter et al., 2019). Thus, AutoML promises to reduce the human effort necessary for applying machine learning, improve the performance of machine learning algorithms, and improve the reproducibility and fairness of scientific studies.

Self-Service Data Science

In the Do-It-Yourself week you will work individually on an NLP computational experiment and experience the course vision of self-service data science. The assignment has many variations in datasets, language models and techniques.

Societal Impact

You decide which popular Data Science book with societal impact you read and pitch!

Other Trends

In the final lecture we will introduce other interesting data science techniques and developments which we could not cover in the course, but which may be worth investigating in a later course or research project.