Portale della Didattica

Distributed architectures for big data processing and analytics

01TUYSM

A.A. 2021/22

Course Language

Inglese

Degree programme(s)

Master of science-level of the Bologna process in Data Science And Engineering - Torino

Course structure

Teaching	Hours
Lezioni	44
Esercitazioni in aula	18
Esercitazioni in laboratorio	18

Lecturers

Teacher	Status	SSD	h.Les	h.Ex	h.Lab	h.Tut	Years teaching
Garza Paolo	Professore Associato	IINF-05/A	44	18	6	0	7

Co-lectures

Espandi

Teacher	Status	SSD	h.Les	h.Ex	h.Lab	h.Tut
Colomba Luca	Docente esterno e/o collaboratore		0	0	30	0

Context

SSD	CFU	Activities	Area context
ING-INF/05	8	B - Caratterizzanti	Ingegneria informatica

Date d'appello

Orario delle lezioni

Statistiche superamento esami

Anno accademico di inizio validità

2021/22

Course description

Traditional data analytic and distributed systems are no more adequate in the big data era. Hence, to efficiently extract relevant knowledge from the big amount of available heterogeneous data, novel data models, programming paradigms, and distributed frameworks are needed. The course addresses the data analytics challenges arising in the Big Data era. Specifically, the course will cover the entire big data processing pipeline, by introducing the state of the art distributed frameworks for big data (e.g., Hadoop and Spark) and the current programming paradigms (e.g., MapReduce, Spark RDDs) that are used to analyze and extract knowledge from big data, also by means of distributed machine learning algorithms. The course will also cover the streaming data analytics topic by means of the state of the art frameworks (e.g., Kafka, Spark streaming, and Storm).

Expected Learning Outcomes

The course aims at providing: • Knowledge of the main problems and opportunities arising in the big data context and technological characteristics of the infrastructures and distributed systems used to deal with big data (e.g., Hadoop and Spark). • Ability to write distributed programs to process and analyze big data by means of novel programming paradigms: the Map Reduce and Spark programming paradigms • Ability to write distributed programs to process and analyze streaming data • Knowledge of state of the art machine learning library for big data (e.g., MLlib) • Ability to design a big data pipeline

Pre-requirements

• Python language • Basic Object-oriented programming skills • Basic knowledge of the SQL language and the relational data model • Knowledge of standard centralized machine learning algorithms

Course topics

Lectures and classroom exercises (62 hours) • Introduction to Big data: characteristics, problems, opportunities (3 hours) • Big data architectures (3 hours) • Hadoop and its ecosystem: infrastructure and basic components (4.5 hours) • Map Reduce programming paradigm (9 hours) • Spark Architecture and RDD-based programming paradigm (14 hours) • Spark SQL and DataFrames (10.5 hours) • Data mining and Machine learning libraries: MLlib (4.5 hours) • Graph analytics: Spark GraphX and GraphFrame (4.5 hours) • Data streaming analytics (Kafka, Spark Streaming, Storm) (9 hours) Laboratory activities (18 hours) • Development of big data applications by means of Hadoop and Spark (18 hours)

Sustainable development goals

Additional information

Course structure

The course consists of Lectures and classroom exercises (62 hours) and Laboratory sessions (18 hours). The laboratory sessions are focused on the main topics of the course (MapReduce, Spark, and MLlib). The Laboratory sessions allow experimental activities on the most widespread open-source and state of the art big data frameworks.

Reading materials

Copies of the slides used during the lectures, examples of written exams and exercises, and manuals for the activities in the laboratory will be made available. All teaching material is downloadable from the course website and the teaching portal. Reference books: • Matei Zaharia, Bill Chambers. Spark: The Definitive Guide (Big Data Processing Made Simple). O'Reilly Media, 2018. • Advanced Analytics and Real-Time Data Processing in Apache Spark. Packt Publishing, 2018. • Tom White. Hadoop, The Definitive Guide. (Third edition). O'Reilly Media, 2015. • Matei Zaharia, Holden Karau, Andy Konwinski, Patrick Wendell. Learning Spark (Lightning-Fast Big Data Analytics). O’Reilly, 2015.

Assessment and grading criteria

Exam: Written test;

The exam aims at assessing (i) the ability of the students to write distributed programs to process and analyze big data by means of novel programming paradigms (MapReduce and Spark-RDD based programming paradigms) and (ii) the knowledge of the students of the main issues related to the big data topic and the technological infrastructures and distributed systems that are used to deal with big data. The exam consists of a written exam that lasts 2 hours. Specifically, the written online test is composed of two parts: - 2 programming exercises (structured as open questions) based on MapReduce- and Spark-based programming (27 points) - 2 multiple choice questions on all the topics addressed during the course (4 points). The programming exercises aim at evaluating the ability of the students to write distributed programs to analyze big data by means of the programming paradigms that are introduced in the course. The multiple choice questions are used to evaluate the knowledge of the theoretical concepts of the course and in particular the knowledge of the characteristics of the main technological infrastructures and distributed systems (Hadoop and Spark) that are used to deal with big data. The evaluation of the programming exercises is based on the correctness and efficiency of the proposed solutions. For each multiple choice question, the students achieve two points if the answer is correct and zero points if the answer is wrong or missing. The exam is closed book. - Books, notes, and any other paper material are not allowed. - Electronic devices of any kind (PC, laptop mobile phone, calculators, etc.), apart from the PC used to take the test, are not allowed. The exam is passed if the mark of the written exam is greater than or equal to 18 points.

In addition to the message sent by the online system, students with disabilities or Specific Learning Disorders (SLD) are invited to directly inform the professor in charge of the course about the special arrangements for the exam that have been agreed with the Special Needs Unit. The professor has to be informed at least one week before the beginning of the examination session in order to provide students with the most suitable arrangements for each specific type of exam.

Assessment and grading criteria for ONLINE exam

Exam: Computer-based written test using the PoliTo platform;

The exam aims at assessing (i) the ability of the students to write distributed programs to process and analyze big data by means of novel programming paradigms (MapReduce and Spark-RDD based programming paradigms) and (ii) the knowledge of the students of the main issues related to the big data topic and the technological infrastructures and distributed systems that are used to deal with big data. The exam consists of a written online exam (with Exam+Respondus platforms) that lasts 2 hours. Specifically, the written online exam is composed of two parts: - 2 programming exercises (structured as open questions) based on MapReduce- and Spark-based programming (27 points) - 2 multiple choice questions on all the topics addressed during the course (4 points). The programming exercises aim at evaluating the ability of the students to write distributed programs to analyze big data by means of the programming paradigms that are introduced in the course. The multiple choice questions are used to evaluate the knowledge of the theoretical concepts of the course and in particular the knowledge of the characteristics of the main technological infrastructures and distributed systems (Hadoop and Spark) that are used to deal with big data. The evaluation of the programming exercises is based on the correctness and efficiency of the proposed solutions. For each multiple choice question, the students achieve two points if the answer is correct and zero points if the answer is wrong or missing. The exam is closed book. - Books, notes, and any other paper material are not allowed. - Electronic devices of any kind (PC, laptop mobile phone, calculators, etc.), apart from the PC used to take the test, are not allowed. The exam is passed if the mark of the written online exam is greater than or equal to 18 points.

Assessment and grading criteria for BLENDED exam (online and onsite)

Exam: Written test; Computer-based written test using the PoliTo platform;

The exam aims at assessing (i) the ability of the students to write distributed programs to process and analyze big data by means of novel programming paradigms (MapReduce and Spark-RDD based programming paradigms) and (ii) the knowledge of the students of the main issues related to the big data topic and the technological infrastructures and distributed systems that are used to deal with big data. The exam consists of a written online exam (online version of the written exam with Exam+Respondus platforms) or a written in-class exam (in-presence version of the exam). The exam lasts 2 hours. Specifically, the written online/in-class exam is composed of two parts: - 2 programming exercises (structured as open questions) based on MapReduce- and Spark-based programming (27 points) - 2 multiple choice questions on all the topics addressed during the course (4 points). The programming exercises aim at evaluating the ability of the students to write distributed programs to analyze big data by means of the programming paradigms that are introduced in the course. The multiple choice questions are used to evaluate the knowledge of the theoretical concepts of the course and in particular the knowledge of the characteristics of the main technological infrastructures and distributed systems (Hadoop and Spark) that are used to deal with big data. The evaluation of the programming exercises is based on the correctness and efficiency of the proposed solutions. For each multiple choice question, the students achieve two points if the answer is correct and zero points if the answer is wrong or missing. The exam is closed book. - Books, notes, and any other paper material are not allowed. - Electronic devices of any kind (PC, laptop mobile phone, calculators, etc.), apart from the PC used to take the test, are not allowed. The exam is passed if the mark of the written online/in-class exam is greater than or equal to 18 points.