Department of Information and Computing Sciences

Departement Informatica Onderwijs
Bachelor Informatica Informatiekunde Kunstmatige intelligentie Master Computing Science Game&Media Technology Artifical Intelligence Human Computer Interaction Business Informatics

Onderwijs Informatica en Informatiekunde

Vak-informatie Informatica en Informatiekunde

Data intensive systems

Course code:INFOMDIS
Credits:7.5 ECTS
Period:period 4 (week 17 through 26, i.e., 26-4-2021 through 2-7-2021; retake week 28)
Timeslot:A of D
Participants:up till now 20 subscriptions
Schedule:Official schedule representation can be found in MyTimetable
Teachers:
formgrouptimeweekroomteacher
innovatie          Ioannis Velegrakis
lecture          Ioannis Velegrakis
Contents:Nowadays, we are producing data at rates that we have never seen before, creating datasets characterized by extreme Volume, Variety and Velocity. Unfortunately, traditional data management technologies have been proven limited in managing data with these characteristics. This led to the term Big Data, as a way to refer to this kind of data, and the new technologies that have been developed to cope with such datasets. This course is an introduction to Big Data management technologies. It aims at providing an understanding of the fundamental principles upon which the Big Data systems have been built, and a good knowledge of the generic features that each such system is having. The course is also covering the use of such tools in data preparation, i.e., all these tasks that data practitioners need to do before they have the data ready for the analytics. In this course, the students will learn how to leverage Big Data frameworks, configure them, know what is needed in order to use them, and be clear on the benefits to expect from them. The knowledge acquired is done at two levels. The first is the processing (by introducing new programming and data processing approaches), and the second is the storage and querying (by presenting new systems designed for such data). The students will also learn how to use these technologies in the data preparation tasks, i.e., integration, cleaning, exploration, and querying. Furthermore, the students will learn how to manage some special forms of data, and in particular, streams and graphs. At the end of the course, the students will be able to face real world challenges by having the ability to identify the right solutions in real life situations involving Big Data, make the right choices in putting in place, configuring, and using big data systems, and perform the required maintenance and optimization tasks. The course is fundamental for the modern data scientists since it provides them with required knowledge on the tools that are available for achieving their goals.

Some of the topics that are touched in the course, include, but are not limited, to: advanced SQL and Data Consistency, Big Data Systems (Map Reduce, HDFS, Spark), Heterogeneous Data Integration (Mappings, Data Cleaning), Data Imputation, NoSQL Databases (Graph Databases, Column Stores), Stream Processing, Pig Latin, Graph Analytics at Large Scale.

Literature:The course will follow different chapters from books of different tools. An indicative list is:
- Graph Databases
- Seven Databases in Seven Weeks
- Mining Massive Datasets
- Learning Spark
- Designing Data Intensive Applications
Course form:The course is taught through in-class lectures (online in the time of COVID). Attending lectures may not be mandatory, yet, students are responsible for all announcements and course material discussed in the class, thus, class participation is expected and encouraged. The lectures consist of presentation of some theories on which Big Data technologies are based, and presentation of specific systems and technologies. Furthermore, a large number of lectures are hands-on experience in which the students are asked to use some specific tool that has been presented in order to solve some problem. The students are expected to study the presented theories from the related textbook, guided by the instructor's slides and also from online manuals that are available on the web. The latter is necessary since the majority of the technologies are new and still evolving, thus, the most up-to-date resources are online.
Exam form:At the end of the course, there will be a written exam in which the students are asked to answer some questions that illustrate they have understood the fundamental concepts of the presented technologies. Furthermore, there will be a course project. The project is self-contained and performed in groups, where the group members are called to develop a solution to a specific real-life problem by using some of the tools presented in the course, and then produce a report.

The project counts as 60% of the final mark, while the written exam as 40%.

Minimum effort to qualify for 2nd chance exam:Om aan de aanvullende toets te mogen meedoen moet de oorspronkelijke uitslag minstens 4 zijn.
Description:

Course Project

A description of the course project can be found by clicking here and the presentation lecture here. The deadline is June 30, 2021.

Schedule

The Reading Material, Slides and Video Lectures are located in Microsoft Sharepoint and are accessible only to the members of the team of the course. All those registered have been added to the team automatically. If you are not, you can contact the professor.

Week Lecture Date Title Reading Material
17
26 Apr Introduction to Data Intensive Systems
    Slides Presented in Class

28 Apr Relational Database Systems Revision
18
03 May Entity Linkage
19
10 May Finding Similar Items

12 May Finding Similar Items (Part II)
20
17 May The Architecture of a DBMS

19 May Map Reduce
21
26 May Distributed File Systems

31 May Introduction to Apache SPARK
22
02 June Apache Spark (Part II)
23
09 June HDFS
  • The material is the same as that of the lecture "Distributed File Systems" above
24
14 June NoSQL Systems Introduction
16 June
25 21 June
23 June
28 June Exam
12 July Resit
icons by Icons8
wijzigen?