Portale della Didattica

Big data processing and analytics

01DSHOV

A.A. 2024/25

Course Language

Inglese

Degree programme(s)

Master of science-level of the Bologna process in Ingegneria Informatica (Computer Engineering) - Torino

Course structure

Teaching	Hours
Lezioni	39
Esercitazioni in aula	6
Esercitazioni in laboratorio	15

Lecturers

Teacher	Status	SSD	h.Les	h.Ex	h.Lab	h.Tut	Years teaching
Garza Paolo	Professore Associato	IINF-05/A	39	6	0	0	4

Co-lectures

Espandi

Context

SSD	CFU	Activities	Area context
ING-INF/05	6	B - Caratterizzanti	Ingegneria informatica

Date d'appello

Orario delle lezioni

Statistiche superamento esami

Anno accademico di inizio validit�

2024/25

Presentazione
Course description

In the big data era, novel data management and data analytic frameworks are needed. Specifically, to manage and fruitfully exploit the big amount of available heterogeneous data, novel data models, programming paradigms, data processing systems, and frameworks have been proposed. The course addresses the challenges arising in the Big Data era. Specifically, the course will cover how to store, retrieve, and analyze big data to extract useful knowledge and hints. The course covers not only data models and data analytics aspects but also novel programming paradigms (i.e., MapReduce and its extensions) and discusses how they can be exploited to support big data engineers and scientists to manage and extract insights from data.

Risultati attesi
Expected Learning Outcomes

The course aims at providing: � Knowledge of the main problems and opportunities arising in the big data context and technological characteristics of the infrastructures and distributed frameworks used to deal with big data (e.g., Hadoop and Spark). � Ability to write distributed programs to process and analyze data by means of novel programming paradigms: MapReduce and Spark-based programming paradigms � Knowledge of the (relational and non-relational) databases systems that are used to store big data

Prerequisiti
Pre-requirements

Object-oriented programming skills, Java language, and basic knowledge of traditional database concepts (relational model and SQL language).

Python language, basic knowledge of Java language, and basic knowledge of traditional database concepts (relational model and SQL language).

Programma
Course topics

Lectures (45 hours) � Introduction to Big data: characteristics, problems, opportunities (3 hours) � Hadoop and its ecosystem: infrastructure and basic components (1.5 hours) � Map Reduce programming paradigm (10.5 hours) � Spark: Spark Architecture, RDD-based and Spark SQL-based programming (15 hours) � Streaming data analytics: Spark Streaming (6 hours) � Data mining and Machine learning library: Spark MLlib (7.5 hours) � Databases for Big data: data models, design, and querying (1.5 hours) Laboratory activities (15 hours) � Development of applications by using Hadoop and Spark (15 hours)

Lectures (45 hours) � Introduction to Big data: characteristics, problems, opportunities (3 hours) � Hadoop and its ecosystem: infrastructure and basic components (1.5 hours) � Map Reduce programming paradigm (9 hours) � Spark: Spark Architecture, RDD-based and Spark SQL-based programming (16.5 hours) � Streaming data analytics: Spark Streaming (7.5 hours) � Data mining and Machine learning library: Spark MLlib (7.5 hours) Laboratory activities (15 hours) � Development of applications by using Hadoop and Spark (15 hours)

Sustainable development goals

Fornire un�educazione di qualit�, equa ed inclusiva, e opportunit� di apprendimento per tutti

Note
Additional information

Organizzazione dell'insegnamento
Course structure

The course consists of Lectures (45 hours) and Laboratory sessions (15 hours). The laboratory sessions are focused on the main topics of the course (Map Reduce, Spark, and MLlib) (15 hours). The Laboratory sessions allow experimental activities on the most widespread open-source products.

Bibliografia
Reading materials

Copies of the slides used during the lectures, examples of written exams and exercises, and manuals for the activities in the laboratory will be made available. All teaching material is downloadable from the course website or the Portal. Reference books: � Matei Zaharia, Bill Chambers. Spark: The Definitive Guide (Big Data Processing Made Simple). O'Reilly Media, 2018. � Advanced Analytics and Real-Time Data Processing in Apache Spark. Packt Publishing, 2018. � Tom White. Hadoop, The Definitive Guide. (Third edition). O'Reilly Media, 2015.

Materiale di supporto allo studio
Study materials

Slides; Esercizi; Esercizi risolti; Esercitazioni di laboratorio; Esercitazioni di laboratorio risolte; Video lezioni dell�anno corrente; Video lezioni tratte da anni precedenti;

Lecture slides; Exercises; Exercise with solutions ; Lab exercises; Lab exercises with solutions; Video lectures (current year); Video lectures (previous years);

Sostenimento anticipato dell�esame
Taking the exam before attending the course

E' possibile sostenere l�esame in anticipo rispetto all�acquisizione della frequenza

You can take this exam before attending the course

Criteri, regole e procedure per l'esame
Assessment and grading criteria

Modalit� di esame: Prova scritta in aula tramite PC con l'utilizzo della piattaforma di ateneo;

Exam: Computer-based written test in class using POLITO platform;

... The exam aims at assessing (i) the ability of the students to write distributed programs to process and analyze big data by means of novel programming paradigms and frameworks (the MapReduce programming paradigm and the Spark-based programming paradigm) and (ii) the knowledge of the students of the main issues related to the big data topic and the technological infrastructures and distributed systems that are used to deal with big data. The exam consists of a written onsite PC-based test (with the Exam platform) that lasts 1.5 hours. Specifically, the written test (max 31 points) is composed of two parts: - 1-3 programming exercises (structured as open questions) based on MapReduce- and Spark-based programming to be solved using the Java language (max 27 points) - 2 multiple choice questions on all the topics addressed during the course (max 4 points). The programming exercises aim at evaluating the ability of the students to write distributed programs to analyze big data by means of the novel programming paradigms that are introduced in the course. The multiple-choice questions are used to evaluate the knowledge of the theoretical concepts of the course and in particular the knowledge of the characteristics of the main technological infrastructures and distributed systems (Hadoop and Spark), including scalable relational and non-relational databases systems, that are used to deal with big data. The evaluation of the programming exercises is based on the correctness and efficiency of the proposed solutions. The multiple-choice questions are evaluated on the correctness of the answer. The exam is open book. - Electronic devices of any kind (PC, laptop mobile phone, calculators, etc.), apart from the PC used to take the test, are not allowed. The exam is passed if the mark of the written exam is greater than or equal to 18 points. If the mark of the written exam is greater than 30 then the final mark will be "30 e lode".

Gli studenti e le studentesse con disabilit� o con Disturbi Specifici di Apprendimento (DSA), oltre alla segnalazione tramite procedura informatizzata, sono invitati a comunicare anche direttamente al/la docente titolare dell'insegnamento, con un preavviso non inferiore ad una settimana dall'avvio della sessione d'esame, gli strumenti compensativi concordati con l'Unit� Special Needs, al fine di permettere al/la docente la declinazione pi� idonea in riferimento alla specifica tipologia di esame.

Exam: Computer-based written test in class using POLITO platform;

The exam aims at assessing (i) the ability of the students to write distributed programs to process and analyze big data by means of novel programming paradigms and frameworks (the MapReduce programming paradigm and the Spark-based programming paradigm) and (ii) the knowledge of the students of the main issues related to the big data topic and the technological infrastructures and distributed systems that are used to deal with big data. The exam consists of a written onsite PC-based test (with the Exam platform) that lasts 1.5 hours. Specifically, the written test (max 31 points) is composed of two parts: - 1-3 programming exercises (structured as open questions) based on MapReduce- and Spark-based programming (max 27 points) - 1-3 multiple choice questions on all the topics addressed during the course (max 4 points). The programming exercises aim at evaluating the ability of the students to write distributed programs to analyze big data through the novel programming paradigms introduced in the course. The multiple-choice questions are used to evaluate the knowledge of the theoretical concepts of the course and the knowledge of the characteristics of the main technological infrastructures and distributed systems (Hadoop and Spark), including scalable relational and non-relational database systems, that are used to deal with big data. The evaluation of the programming exercises is based on the correctness and efficiency of the proposed solutions. The multiple-choice questions are evaluated on the correctness of the answer. The exam is open book. - Electronic devices of any kind (PC, laptop, mobile phone, calculators, etc.), apart from the PC used to take the test, are not allowed. The exam is passed if the mark of the written exam is greater than or equal to 18 points. If the mark of the written exam is greater than 30, the final mark will be "30 e lode".

In addition to the message sent by the online system, students with disabilities or Specific Learning Disorders (SLD) are invited to directly inform the professor in charge of the course about the special arrangements for the exam that have been agreed with the Special Needs Unit. The professor has to be informed at least one week before the beginning of the examination session in order to provide students with the most suitable arrangements for each specific type of exam.