Department of Information and Computing Sciences

Departement Informatica Onderwijs
Bachelor Informatica Informatiekunde Kunstmatige intelligentie Master Computing Science Game&Media Technology Artifical Intelligence Human Computer Interaction Business Informatics

Onderwijs Informatica en Informatiekunde

Vak-informatie Informatica en Informatiekunde

Data science and society

Website:website containing additional information
Course code:INFOMDSS
Credits:7.5 ECTS
Period:period 1 (week 36 through 45, i.e., 3-9-2020 through 6-11-2020; retake week 1)
Timeslot:C
Participants:up till now 126 subscriptions
Schedule:Official schedule representation can be found in MyTimetable
Teachers:
formgrouptimeweekroomteacher
lecture          Marco Spruit
Hakim Qahtan
tutorial group 1        Max van Haastrecht
Floris Emanuel
group 2        Mehrad Abdollahi
group 3        Thomas Hes
Contents:
This is the starting and obligatory course for the Business Informatics (MBI) programme as well as the Applied Data Science profile. As such, its primary objective is to inspire and introduce you to the exciting domain of Applied Data Science. At the end of this course, you will be able to:
  1. Understand the role of data science and its societal impact
  2. Recognise the knowledge discovery processes in applied data science
  3. Identify trends and developments in big data technologies
  4. Apply selected big data technologies to solve real-world problems
  5. Analyse unstructured data using natural language processing techniques
  6. Understand the need for self-service data science

The short url for this official course page is: http://bit.ly/dss2020-cs.
The official course schedule overview is available at: http://bit.ly/dss2020-overview.

Literature:We provide PDFs for most if not all required literature in the Teams group. The required readings include:
  • Igual, L., & Seguí, S. (2017). Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications. Switserland: Springer. [url]
  • Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., & Wirth, R. (2000). CRISP-DM 1.0: Step-by-step data mining guide. SPSS inc, 16. [Sections 1,2] [url]
  • Hutter, F., Kotthoff, L., & Vanschoren, J. (2019). Automated Machine Learning - Methods, Systems, Challenges. Springer. [Chapter 1 (required); Chapter 8 (additional)] [url]
  • Clark, A., Fox, C., & Lappin, S. (Eds.). (2013). The handbook of computational linguistics and natural language processing. John Wiley & Sons. [Chapter 1 (required); Chapters 4,9 (additional)] [url]
Additional literature includes:
  • Davenport, T., & Patil, D. (2012). Data scientist. Harvard business review, 90(5), 70-76. [url]
  • Pritzker, P., & May, W. (2015). NIST Big Data interoperability Framework (NBDIF): Volume 1: Definitions. NIST Special Publication, 1500(1). [Chapters 2, Appendix A] [pdf]
  • Spruit, M., & Lytras, M. (2018). Applied Data Science in Patient-centric Healthcare: Adaptive Analytic Systems for Empowering Physicians and Patients. Telematics and Informatics, 35(4), 643-653. [pdf]
  • Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113. [pdf]
  • Cattell, R. (2011). Scalable SQL and NoSQL data stores. ACM Sigmod Record, 39(4), 12-27. [pdf]
  • Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Stanford InfoLab. [url]
  • Ooms, R., & Spruit, M. (2020). Self-Service Data Science in Healthcare with Automated Machine Learning. Applied Sciences, 10(9), Medical Artificial Intelligence, 2992.[url]
  • Spruit, M., & Meijers, S. (2019). Big Data for the Masses: The CRISP-DCW Method for Distributed Computing Workflows. In Visvizi,A., & Lytras,M. (Eds.), Springer Proceedings in Complexity, Research & Innovation Forum 2019 (pp. 325–341). RII 2019, Rome, Italy: Springer. [pdf]
  • Anderson, C. (2008). The end of theory: The data deluge makes the scientific method obsolete. Wired magazine, 16(7), 16-07. [url]
  • Ambrose, M. (2015). Lessons from the avalanche of numbers: big data in historical perspective. I/S: A Journal of Law and Policy for the Information Society, 11, 201. [pdf]
Course form:This edition of our course in Corona times is somewhat differently structured... We do keep the twice-a-week lecture slots, in MS Teams streaming format. However, these sessions will mostly start with an interactive multiple choice quiz, which is just for fun and to informally test your current knowledge, and be followed by a general Q/A session for any remaining questions. These sessions will be recorded and it is not mandatory to attend any lectures.

Regular lecture materials will be provided as videos to be viewed anytime. This is why we will have regular quizes to test and help you remind whether you actually watched and read all materials. The workshop sessions will be taking place online as well in a standard asynchronous discussion channel format on MS Teams. Our TA and SAs will try to answer any queries asap in the Technical Support channel.

Throughout the course, you are given a number of individual (mostly quite small) assignments. The answers to the assignments are to be submitted to the appropriate channel in our DSS 2020 Teams group before the stated deadline (mostly one week after release). There will be no deadline extensions, so be sure to submit appropriately. These assignments will be assessed but not graded: you either PASS or FAIL. When you have FAILed 20 percent or more of the total number of assignments, you will have FAILed the course due to the 'inspanningsverplichting' (course effort) criterion. However, if you did PASS at least 65% of the assignments, you will be given the opportunity to do the REPAIR assignment (which is a relatively big assignment).

e.g. With 16 assignments, you will need to PASS 13/16 (~81%) assignments. In case you have either 11 or 12 PASSes, you qualify for the substantial REPAIR assignment. Should you merely PASS 10 (~63%) or less assignments, then you have FAILed the course without a second chance.

To help you complete the assignments, this class is also supported by the DataCamp learning platform for Python, SQL and more, through a combination of short expert videos and hands-on-the-keyboard exercises.

Exam form:The final grade will be determined based on the following course components:
[A] Mid-term exam
[B] End-term exam
[C] Optional bonus (or penalty) for extraordinary (or poor) participation/performance

Grade = [A]*0.50 + [B]*0.50 + [C]

Note that the minimum grade of each of these exams is a 5.0. If for one of the exams your grade is between a 4.0 and a 5.5, you can repair that specific exam during the “second chance” session. Note that it is not possible to repair both exams. You need to have a final grade of 6.0 or higher to PASS the course.

All course materials are examined, including all lecture slides, assignments and weekly readings.

Minimum effort to qualify for 2nd chance exam:In order to qualify for the Repair Exam, ALL grade components need to be 5.0 or higher, and you also need to have PASSed at least 65% of the assignments.
Description:

Applied Data Science

The first theme is Applied Data Science (ADS) as positioned in (Braschler et al., 2019) and defined in (Spruit & Lytras, 2018) as “the knowledge discovery process in which analytic systems are designed and evaluated to improve the daily practices of domain experts”. Being the core theme of this course, we cover the need for data scientists (e.g. Davenport & Patil, 2012) and relate this novel topic with the well-known domain of knowledge discovery processes (Chapman et al., 2000). We refer to standardised NIST definitions (Pritzker & May, 2015) to properly ground our ADS perspective.

Data Analytics

Data analytics is the multidisciplinary field which aims to make sense of data and observations from everyday life. Its data-driven approach to problem solving includes various methods and techniques. In this theme we focus on discussing why certain approaches work, what common mistakes are made, and so on, using (Lazer et al., 2014; Broniatowski et al., 2014) as a running example. We will also discuss data analytics tasks from both statistical and machine learning perspectives.

Big Data & Cloud Computing

The original course trigger was the inability of researchers to analyse datasets which were simply too big to process on a laptop. On the one hand they can use someone else’s bigger computer (e.g. Cloud Computing) and on the other hand they can employ other data analysis techniques that are designed to be limitlessly scalable. The prime example of such an analysis technique is MapReduce, which we will discuss both from the original Hadoop perspective (Dean & Ghemawat, 2008) as well as from its successors within the increasingly more popular Spark environment (Chambers & Zacharia, 2018). Furthermore, we also note the more philosophical implications of Big Data technologies using (Ambrose, 2015). How do we know that we know? What are the epistemological implications of Big Data analyses on the theory of knowledge? Would a historical perspective be helpful?

Natural Language Processing

We introduce the field of Natural Language Processing (NLP) as a key technology within data science and artificial intelligence. Applications of NLP are everywhere where people communicate, including web search, scientific papers, emails, customer service, language translation, and clinical reports. Recently, deep learning approaches have obtained very high performance across many different NLP tasks. However, for decades NLP has mostly been based on symbolic approaches instead. Current NLP research aims to meaningfully integrate these two paradigms to better understand human language. Therefore, we will introduce you first to some classical linguistic theories before moving into more recent neural network-based NLP approaches, based on (Clark et al., 2013). Furthermore, the computational experiment assignment will allow you to experiment more in-depth with a state-of-the-art approach within this fast moving field of NLP.

Automated Machine Learning

As identified in (Spruit & Jagesar, 2016), one of the major challenges in correctly applying Machine Learning techniques in Applied Data Science projects is the so-called Selection vs Configuration dilemma. Often it is quite hard to select the best algorithm for a given data analysis task, and even harder to properly configure its (hyper-)parameters. Even for data scientists. One promising solution might be Automated Machine Learning (Hutter et al., 2019). Thus, AutoML promises to reduce the human effort necessary for applying machine learning, improve the performance of machine learning algorithms, and improve the reproducibility and fairness of scientific studies.

Self-Service Data Science

In the Do-It-Yourself week you will work individually on an NLP computational experiment and experience the course vision of self-service data science. The assignment has many variations in datasets, language models and techniques.

Societal Impact

You decide which popular Data Science book with societal impact you read and pitch!

Other Trends

In the final lecture we will introduce other interesting data science techniques and developments which we could not cover in the course, but which may be worth investigating in a later course or research project.
wijzigen?