Portale della Didattica

Big data for Internet applications

01TUVBH

A.A. 2020/21

Course Language

Inglese

Degree programme(s)

Master of science-level of the Bologna process in Ict For Smart Societies (Ict Per La Societa' Del Futuro) - Torino

Course structure

Teaching	Hours
Lezioni	45
Esercitazioni in laboratorio	15

Lecturers

Teacher	Status	SSD	h.Les	h.Ex	h.Lab	h.Tut	Years teaching
Garza Paolo	Professore Associato	IINF-05/A	20	0	0	0	4

Co-lectures

Espandi

Teacher	Status	SSD	h.Les	h.Ex	h.Lab	h.Tut
Vassio Luca	Professore Associato	IINF-05/A	25	0	5	0

Context

SSD	CFU	Activities	Area context
ING-INF/03 ING-INF/05	3 3	D - A scelta dello studente D - A scelta dello studente	A scelta dello studente A scelta dello studente

Date d'appello

Orario delle lezioni

Statistiche superamento esami

Anno accademico di inizio validita

2020/21

Presentazione
Course description

In the big data era traditional data management and analytic systems are no more adequate to efficiently and effectively analyzed large amount of (internet-related) data. Hence, novel data models, programming paradigms and database management systems are needed. The course addresses the challenges arising in the Big Data era, examining in depth big data processing and knowledge extraction for big data. Specifically, the course covers how to collect, store, retrieve, and analyze big data to mine useful knowledge for internet applications. The course covers not only data analytics aspects but also novel programming paradigms (e.g., MapReduce, Spark RDD-based programs) and discusses how they can be exploited to support engineers to extract knowledge from data. Practical examples of big data techniques for data science applied to internet domain will be presented.

Risultati attesi
Expected Learning Outcomes

The course aims at providing: � Knowledge of the main technological characteristics of the infrastructures and distributed frameworks used to deal with big data (e.g., Hadoop and Spark). � Ability to write distributed programs to process and analyze big data by means of big data frameworks (Spark RDD- and DataFrame-based programming). � Ability to implement scalable data analytics processes, based on data mining and machine learning algorithms, for internet applications (e.g., network traffic data analysis). � Knowledge of the (relational and non-relational) databases systems that are used to store and query big data.

Prerequisiti
Pre-requirements

Basic object-oriented programming skills Knowledge of the Python programming language

Programma
Course topics

Lectures (45 hours) � Introduction to Big data: characteristics, problems, opportunities (3 hours) � Hadoop and its ecosystem: infrastructure and basic components (3 hours) � Apache Spark Architecture (3 hours) � Spark RDD- and dataset-based programming paradigm (16.5 hours) � Streaming data analysis: Spark Streaming (3 hours) � Data mining and Machine learning libraries: Spark MLlib (4.5 hours) � Graph analytics: Spark GraphX and GraphFrame (4.5 hours) � Databases for Big data: data models, design, and querying (e.g., HBase and MongoDB) (4.5 hours) � Introduction to network traffic data analytics (3 hours) Laboratory activities (15 hours) � Developing of applications for big data analytics based on Spark (15 hours)

Lectures (42 hours) � Introduction to Big data: characteristics, problems, opportunities (3 hours) � Hadoop and its ecosystem: infrastructure and basic components (3 hours) � Apache Spark Architecture (3 hours) � Spark RDD- and dataset-based programming paradigm (16.5 hours) � Streaming data analysis: Spark Streaming (3 hours) � Data mining and Machine learning libraries: Spark MLlib (3 hours) � Graph analytics: Spark GraphX and GraphFrame (3 hours) � Databases for Big data: data models, design, and querying (e.g., HBase and MongoDB) (4.5 hours) � Introduction to network traffic data analytics (3 hours) Laboratory activities (18 hours) � Developing of applications for big data analytics based on Spark (18 hours)

Sustainable development goals

Fornire un�educazione di qualit�, equa ed inclusiva, e opportunit� di apprendimento per tutti

Note
Additional information

Organizzazione dell'insegnamento
Course structure

The course consists of Lectures (45 hours) and Laboratory sessions (15 hours). The laboratory sessions are focused on the main topics of the course (Apache Spark, MLlib, NoSQL databases) (15 hours). The Laboratory sessions allow experimental activities on the most widespread big data frameworks.

The course consists of Lectures (42 hours) and Laboratory sessions (18 hours). The laboratory sessions are focused on the main topics of the course (Apache Spark, MLlib, NoSQL databases). The Laboratory sessions allow experimental activities on the most widespread big data frameworks.

Bibliografia
Reading materials

Copies of the slides used during the lectures, examples of exercises, and manuals for the activities in the laboratory will be made available. All teaching material is downloadable from the course website or the Teaching Portal. Reference books: � Matei Zaharia, Bill Chambers. Spark: The Definitive Guide (Big Data Processing Made Simple). O'Reilly Media, 2018. � Advanced Analytics and Real-Time Data Processing in Apache Spark. Packt Publishing, 2018. � Tom White. Hadoop, The Definitive Guide. (Third edition). O'Reilly Media, 2015.

Criteri, regole e procedure per l'esame esclusivamente IN REMOTO
Assessment and grading criteria for ONLINE exam

Modalita di esame: Elaborato scritto individuale; Prova scritta tramite PC con l'utilizzo della piattaforma di ateneo;

Exam: Written online test (by using the the Exam platform + Respondus); Individual essay The exam aims at assessing (i) the ability of the students to write distributed programs to process and analyze big data by means of novel programming paradigms (the Spark RDD- and dataset-based programming paradigm) and frameworks and (ii) the knowledge of the students of the main concepts related to the big data topic and the technological infrastructures and distributed systems, including scalable relational and non-relational databases systems, that are used to deal with big data. The exam includes two mandatory parts. The two mandatory parts are (i) a written online test (based on Exam+Respondus) and (ii) the evaluation of an individual report on the practices assigned during the course. PART I - WRITTEN ONLINE TEST The written part of the exam lasts 2 hours and it is composed of two subparts: - 2 programming exercises (Spark RDD- and dataset-based programming) (27 points) - 2 multiple choice questions on all the topics addressed during the course (4 points) The programming exercises aim at evaluating the ability of the students to write distributed programs to analyze big data by means of the novel programming paradigms that are introduced in the course. The multiple choice questions are used to evaluate the knowledge of the theoretical concepts of the course and in particular the knowledge of the characteristics of the main technological infrastructures and distributed systems (Hadoop and Spark), including scalable relational and non-relational databases systems, that are used to deal with big data. The evaluation of the programming exercises is based on the correctness and efficiency of the proposed solutions. For each multiple choice question, the students achieve two points if the answer is correct and zero points if the answer is wrong or missing. The written online test is closed book (notes and books are not allowed during the exam). PART II - INDIVIDUAL REPORT The second part of the exam consists in preparing an individual report on the practices assigned during the course and developed in laboratories. The report aims at evaluating the ability of the students to implement data analytics processes for analyzing big data. The evaluation of the report is based on the clarity of the report and on technical correctness and efficiency of the proposed and implemented solutions. The maximum grade for the individual report is 31. FINAL GRADE The exam is passed if (i) the grade of the written online test is greater than or equal to 18 points and (ii) the grade of the individual report is greater than or equal to 18 points. The final grade is a weighted average between the evaluations of the written test (80%) and the individual report (20%). Specifically, the final grade, without the optional oral exam, is given by the following weighted average: grade of the written test*0.8 + grade of the report*0.2

Exam: Individual essay; Computer-based written test using the PoliTo platform;

Exam: Written online test (by using the Exam platform + Respondus); Individual essay The exam aims at assessing (i) the ability of the students to write distributed programs to process and analyze big data by means of novel programming paradigms (the Spark RDD- and dataset-based programming paradigm) and frameworks and (ii) the knowledge of the students of the main concepts related to the big data topic and the technological infrastructures and distributed systems, including scalable relational and non-relational databases systems, that are used to deal with big data. The exam includes two mandatory parts. The two mandatory parts are (i) a written online test (based on Exam+Respondus) and (ii) the evaluation of an individual report on the practices assigned during the course. PART I - WRITTEN ONLINE TEST The written test (online version with Exam+Respondus platforms) lasts 2 hours and it is composed of two subparts: - 2 programming exercises (Spark RDD- and DataFrame-based programming) , structured as open questions (27 points) - 2 multiple choice questions on all the topics addressed during the course (4 points) The programming exercises aim at evaluating the ability of the students to write distributed programs to analyze big data by means of the novel programming paradigms that are introduced in the course. The multiple choice questions are used to evaluate the knowledge of the theoretical concepts of the course and in particular the knowledge of the characteristics of the main technological infrastructures and distributed systems (Hadoop and Spark), including scalable relational and non-relational databases systems, that are used to deal with big data. The evaluation of the programming exercises is based on the correctness and efficiency of the proposed solutions. For each multiple choice question, the students achieve two points if the answer is correct and zero points if the answer is wrong or missing. The written online test is closed book. - Books, notes, and any other paper material are not allowed. - Electronic devices of any kind (PC, laptop mobile phone, calculators, etc.), apart from the PC used to take the test, are not allowed. The maximum grade for the written online test is 31. PART II - INDIVIDUAL REPORT The second part of the exam consists in preparing an individual report on the practices assigned during the course and developed in laboratories. The report aims at evaluating the ability of the students to implement data analytics processes for analyzing big data. The evaluation of the report is based on the clarity of the report and on technical correctness and efficiency of the proposed and implemented solutions. The maximum grade for the individual report is 31. FINAL GRADE The exam is passed if (i) the grade of the written online test is greater than or equal to 18 points and (ii) the grade of the individual report is greater than or equal to 18 points. The final grade is a weighted average between the evaluations of the written test (80%) and the individual report (20%). Specifically, the final grade is given by the following weighted average: grade of the written test*0.8 + grade of the report*0.2

Criteri, regole e procedure per l'esame IN MODALITA' MISTA (in remoto e in presenza)
Assessment and grading criteria for BLENDED exam (online and onsite)

Modalita di esame: Test informatizzato in laboratorio; Elaborato scritto individuale; Prova scritta tramite PC con l'utilizzo della piattaforma di ateneo;

Exam: Written in lab test (by using the the Exam platform); Individual essay The exam aims at assessing (i) the ability of the students to write distributed programs to process and analyze big data by means of novel programming paradigms (the Spark RDD- and dataset-based programming paradigm) and frameworks and (ii) the knowledge of the students of the main concepts related to the big data topic and the technological infrastructures and distributed systems, including scalable relational and non-relational databases systems, that are used to deal with big data. The exam includes two mandatory parts. The two mandatory parts are (i) a written online test (based on Exam+Respondus) and (ii) the evaluation of an individual report on the practices assigned during the course. PART I - WRITTEN IN LAB TEST The written part of the exam lasts 2 hours and it is composed of two subparts: - 2 programming exercises (Spark RDD- and dataset-based programming) (27 points) - 2 multiple choice questions on all the topics addressed during the course (4 points) The programming exercises aim at evaluating the ability of the students to write distributed programs to analyze big data by means of the novel programming paradigms that are introduced in the course. The multiple choice questions are used to evaluate the knowledge of the theoretical concepts of the course and in particular the knowledge of the characteristics of the main technological infrastructures and distributed systems (Hadoop and Spark), including scalable relational and non-relational databases systems, that are used to deal with big data. The evaluation of the programming exercises is based on the correctness and efficiency of the proposed solutions. For each multiple choice question, the students achieve two points if the answer is correct and zero points if the answer is wrong or missing. The written online test is closed book (notes and books are not allowed during the exam). PART II - INDIVIDUAL REPORT The second part of the exam consists in preparing an individual report on the practices assigned during the course and developed in laboratories. The report aims at evaluating the ability of the students to implement data analytics processes for analyzing big data. The evaluation of the report is based on the clarity of the report and on technical correctness and efficiency of the proposed and implemented solutions. The maximum grade for the individual report is 31. FINAL GRADE The exam is passed if (i) the grade of the written test is greater than or equal to 18 points and (ii) the grade of the individual report is greater than or equal to 18 points. The final grade is a weighted average between the evaluations of the written test (80%) and the individual report (20%). Specifically, the final grade, without the optional oral exam, is given by the following weighted average: grade of the written test*0.8 + grade of the report*0.2

Exam: Computer lab-based test; Individual essay; Computer-based written test using the PoliTo platform;

Exam: Written test in lab (by using the Exam platform)/online (by using the Exam platform + Respondus); Individual essay The exam aims at assessing (i) the ability of the students to write distributed programs to process and analyze big data by means of novel programming paradigms (the Spark RDD- and dataset-based programming paradigm) and frameworks and (ii) the knowledge of the students of the main concepts related to the big data topic and the technological infrastructures and distributed systems, including scalable relational and non-relational databases systems, that are used to deal with big data. The exam includes two mandatory parts. The two mandatory parts are (i) a written in lab test (onsite version) or a written online test (online version) and (ii) the evaluation of an individual report on the practices assigned during the course. PART I - WRITTEN IN LAB TEST (ONSITE VERSiON) OR WRITTEN ONLINE TEST (ONLINE VERSION) The written test (onsite version with the Exam platform or online version with Exam+Respondus platforms) lasts 2 hours and it is composed of two subparts: - 2 programming exercises (Spark RDD- and DataFrame-based programming) , structured as open questions (27 points) - 2 multiple choice questions on all the topics addressed during the course (4 points) The maximum grade for the written test is 31. The programming exercises aim at evaluating the ability of the students to write distributed programs to analyze big data by means of the novel programming paradigms that are introduced in the course. The multiple choice questions are used to evaluate the knowledge of the theoretical concepts of the course and in particular the knowledge of the characteristics of the main technological infrastructures and distributed systems (Hadoop and Spark), including scalable relational and non-relational databases systems, that are used to deal with big data. The evaluation of the programming exercises is based on the correctness and efficiency of the proposed solutions. For each multiple choice question, the students achieve two points if the answer is correct and zero points if the answer is wrong or missing. The written test is closed book. - Books, notes, and any other paper material are not allowed. - Electronic devices of any kind (PC, laptop mobile phone, calculators, etc.), apart from the PC used to take the test, are not allowed. PART II - INDIVIDUAL REPORT The second part of the exam consists in preparing an individual report on the practices assigned during the course and developed in laboratories. The report aims at evaluating the ability of the students to implement data analytics processes for analyzing big data. The evaluation of the report is based on the clarity of the report and on technical correctness and efficiency of the proposed and implemented solutions. The maximum grade for the individual report is 31. FINAL GRADE The exam is passed if (i) the grade of the written test is greater than or equal to 18 points and (ii) the grade of the individual report is greater than or equal to 18 points. The final grade is a weighted average between the evaluations of the written test (80%) and the individual report (20%). Specifically, the final grade is given by the following weighted average: grade of the written test*0.8 + grade of the report*0.2