Servizi per la didattica
PORTALE DELLA DIDATTICA

Distributed architectures for big data processing and analytics

01TUYSM

A.A. 2020/21

Course Language

Inglese

Course degree

Master of science-level of the Bologna process in Data Science And Engineering - Torino

Course structure
Teaching Hours
Lezioni 44
Esercitazioni in aula 18
Esercitazioni in laboratorio 18
Teachers
Teacher Status SSD h.Les h.Ex h.Lab h.Tut Years teaching
Garza Paolo Professore Associato ING-INF/05 44 6 0 0 2
Teaching assistant
Espandi

Context
SSD CFU Activities Area context
ING-INF/05 8 B - Caratterizzanti Ingegneria informatica
2020/21
Traditional data analytic and distributed systems are no more adequate in the big data era. Hence, to efficiently extract relevant knowledge from the big amount of available heterogeneous data, novel data models, programming paradigms, and distributed frameworks are needed. The course addresses the data analytics challenges arising in the Big Data era. Specifically, the course will cover the entire big data processing pipeline, by introducing the state of the art distributed frameworks for big data (e.g., Hadoop and Spark) and the current programming paradigms (e.g., MapReduce, Spark RDDs) that are used to analyze and extract knowledge from big data, also by means of distributed machine learning algorithms. The course will also cover the streaming data analytics topic by means of the state of the art frameworks (e.g., Kafka, Spark streaming, and Storm).
Traditional data analytic and distributed systems are no more adequate in the big data era. Hence, to efficiently extract relevant knowledge from the big amount of available heterogeneous data, novel data models, programming paradigms, and distributed frameworks are needed. The course addresses the data analytics challenges arising in the Big Data era. Specifically, the course will cover the entire big data processing pipeline, by introducing the state of the art distributed frameworks for big data (e.g., Hadoop and Spark) and the current programming paradigms (e.g., MapReduce, Spark RDDs) that are used to analyze and extract knowledge from big data, also by means of distributed machine learning algorithms. The course will also cover the streaming data analytics topic by means of the state of the art frameworks (e.g., Kafka, Spark streaming, and Storm).
The course aims at providing: • Knowledge of the main problems and opportunities arising in the big data context and technological characteristics of the infrastructures and distributed systems used to deal with big data (e.g., Hadoop and Spark). • Ability to write distributed programs to process and analyze big data by means of novel programming paradigms: the Map Reduce and Spark programming paradigms • Ability to write distributed programs to process and analyze streaming data • Knowledge of state of the art machine learning library for big data (e.g., MLlib) • Ability to design a big data pipeline
The course aims at providing: • Knowledge of the main problems and opportunities arising in the big data context and technological characteristics of the infrastructures and distributed systems used to deal with big data (e.g., Hadoop and Spark). • Ability to write distributed programs to process and analyze big data by means of novel programming paradigms: the Map Reduce and Spark programming paradigms • Ability to write distributed programs to process and analyze streaming data • Knowledge of state of the art machine learning library for big data (e.g., MLlib) • Ability to design a big data pipeline
• Object-oriented programming skills • Knowledge of standard centralized machine learning algorithms
• Object-oriented programming skills • Knowledge of standard centralized machine learning algorithms
Lectures and classroom exercises (62 hours) • Introduction to Big data: characteristics, problems, opportunities (3 hours) • Design of a big data pipeline (3 hours) • Hadoop and its ecosystem: infrastructure and basic components (4.5 hours) • Map Reduce programming paradigm (9 hours) • Spark Architecture and RDD-based programming paradigm (21.5 hours) • Data streaming analytics (Kafka, Spark Streaming, Storm) (12 hours) • Data mining and Machine learning libraries: MLlib (9 hours) Laboratory activities (18 hours) • Development of big data applications by means of Hadoop and Spark (18 hours)
Lectures and classroom exercises (62 hours) • Introduction to Big data: characteristics, problems, opportunities (3 hours) • Design of a big data pipeline (3 hours) • Hadoop and its ecosystem: infrastructure and basic components (4.5 hours) • Map Reduce programming paradigm (9 hours) • Spark Architecture and RDD-based programming paradigm (21.5 hours) • Data streaming analytics (Kafka, Spark Streaming, Storm) (12 hours) • Data mining and Machine learning libraries: MLlib (9 hours) Laboratory activities (18 hours) • Development of big data applications by means of Hadoop and Spark (18 hours)
The course consists of Lectures and classroom exercises (62 hours) and Laboratory sessions (18 hours). The laboratory sessions are focused on the main topics of the course (MapReduce, Spark, and MLlib). The Laboratory sessions allow experimental activities on the most widespread open-source and state of the art big data frameworks.
The course consists of Lectures and classroom exercises (62 hours) and Laboratory sessions (18 hours). The laboratory sessions are focused on the main topics of the course (MapReduce, Spark, and MLlib). The Laboratory sessions allow experimental activities on the most widespread open-source and state of the art big data frameworks.
Copies of the slides used during the lectures, examples of written exams and exercises, and manuals for the activities in the laboratory will be made available. All teaching material is downloadable from the course website and the teaching portal. Reference books: • Matei Zaharia, Bill Chambers. Spark: The Definitive Guide (Big Data Processing Made Simple). O'Reilly Media, 2018. • Advanced Analytics and Real-Time Data Processing in Apache Spark. Packt Publishing, 2018. • Tom White. Hadoop, The Definitive Guide. (Third edition). O'Reilly Media, 2015. • Matei Zaharia, Holden Karau, Andy Konwinski, Patrick Wendell. Learning Spark (Lightning-Fast Big Data Analytics). O’Reilly, 2015.
Copies of the slides used during the lectures, examples of written exams and exercises, and manuals for the activities in the laboratory will be made available. All teaching material is downloadable from the course website and the teaching portal. Reference books: • Matei Zaharia, Bill Chambers. Spark: The Definitive Guide (Big Data Processing Made Simple). O'Reilly Media, 2018. • Advanced Analytics and Real-Time Data Processing in Apache Spark. Packt Publishing, 2018. • Tom White. Hadoop, The Definitive Guide. (Third edition). O'Reilly Media, 2015. • Matei Zaharia, Holden Karau, Andy Konwinski, Patrick Wendell. Learning Spark (Lightning-Fast Big Data Analytics). O’Reilly, 2015.
Modalità di esame: Prova scritta a risposta aperta o chiusa tramite PC con l'utilizzo della piattaforma di ateneo Exam integrata con strumenti di proctoring (Respondus);
The exam aims at assessing (i) the ability of the students to write distributed programs to process and analyze big data by means of novel programming paradigms (MapReduce and Spark-RDD based programming paradigms) and (ii) the knowledge of the students of the main issues related to the big data topic and the technological infrastructures and distributed systems that are used to deal with big data. The exam consists of a written online test that lasts 2 hours. Specifically, the written online test is composed of two parts: - 2 programming exercises (structured as open questions) based on MapReduce- and Spark-based programming (27 points) - 2 multiple choice questions on all the topics addressed during the course (4 points). The programming exercises aim at evaluating the ability of the students to write distributed programs to analyze big data by means of the programming paradigms that are introduced in the course. The multiple choice questions are used to evaluate the knowledge of the theoretical concepts of the course and in particular the knowledge of the characteristics of the main technological infrastructures and distributed systems (Hadoop and Spark) that are used to deal with big data. The evaluation of the programming exercises is based on the correctness and efficiency of the proposed solutions. For each multiple choice question, the students achieve two points if the answer is correct and zero points if the answer is wrong or missing. The exam is closed book. - Books, notes, empty sheets and any other paper materials are not allowed. - Electronic devicesof any kind (PC, laptop mobile phone, calculators, etc.), apart from the PC used to take the test, are not allowed. The exam is passed if the mark of the written exam is greater than or equal to 18 points.
Exam: Computer-based written test with open-ended questions or multiple-choice questions using the Exam platform and proctoring tools (Respondus);
The exam aims at assessing (i) the ability of the students to write distributed programs to process and analyze big data by means of novel programming paradigms (MapReduce and Spark-RDD based programming paradigms) and (ii) the knowledge of the students of the main issues related to the big data topic and the technological infrastructures and distributed systems that are used to deal with big data. The exam consists of a written online test (online version with Exam+Respondus platforms) that lasts 2 hours. Specifically, the written online test is composed of two parts: - 2 programming exercises (structured as open questions) based on MapReduce- and Spark-based programming (27 points) - 2 multiple choice questions on all the topics addressed during the course (4 points). The programming exercises aim at evaluating the ability of the students to write distributed programs to analyze big data by means of the programming paradigms that are introduced in the course. The multiple choice questions are used to evaluate the knowledge of the theoretical concepts of the course and in particular the knowledge of the characteristics of the main technological infrastructures and distributed systems (Hadoop and Spark) that are used to deal with big data. The evaluation of the programming exercises is based on the correctness and efficiency of the proposed solutions. For each multiple choice question, the students achieve two points if the answer is correct and zero points if the answer is wrong or missing. The exam is closed book. - Books, notes, and any other paper material are not allowed. - Electronic devices of any kind (PC, laptop mobile phone, calculators, etc.), apart from the PC used to take the test, are not allowed. The exam is passed if the mark of the test is greater than or equal to 18 points.
Modalità di esame: Test informatizzato in laboratorio; Prova scritta a risposta aperta o chiusa tramite PC con l'utilizzo della piattaforma di ateneo Exam integrata con strumenti di proctoring (Respondus);
The exam aims at assessing (i) the ability of the students to write distributed programs to process and analyze big data by means of novel programming paradigms (MapReduce and Spark-RDD based programming paradigms) and (ii) the knowledge of the students of the main issues related to the big data topic and the technological infrastructures and distributed systems that are used to deal with big data. The exam consists of a written online/in lab test that lasts 2 hours. Specifically, the written online test is composed of two parts: - 2 programming exercises (structured as open questions) based on MapReduce- and Spark-based programming (27 points) - 2 multiple choice questions on all the topics addressed during the course (4 points). The programming exercises aim at evaluating the ability of the students to write distributed programs to analyze big data by means of the programming paradigms that are introduced in the course. The multiple choice questions are used to evaluate the knowledge of the theoretical concepts of the course and in particular the knowledge of the characteristics of the main technological infrastructures and distributed systems (Hadoop and Spark) that are used to deal with big data. The evaluation of the programming exercises is based on the correctness and efficiency of the proposed solutions. For each multiple choice question, the students achieve two points if the answer is correct and zero points if the answer is wrong or missing. The exam is closed book. - Books, notes, empty sheets and any other paper materials are not allowed. - Electronic devicesof any kind (PC, laptop mobile phone, calculators, etc.), apart from the PC used to take the test, are not allowed. The exam is passed if the mark of the written exam is greater than or equal to 18 points.
Exam: Computer lab-based test; Computer-based written test with open-ended questions or multiple-choice questions using the Exam platform and proctoring tools (Respondus);
The exam aims at assessing (i) the ability of the students to write distributed programs to process and analyze big data by means of novel programming paradigms (MapReduce and Spark-RDD based programming paradigms) and (ii) the knowledge of the students of the main issues related to the big data topic and the technological infrastructures and distributed systems that are used to deal with big data. The exam consists of a written online test (online version with Exam+Respondus platforms) or a written in lab test (onsite version with the Exam platform) that lasts 2 hours. Specifically, the written online/in lab test is composed of two parts: - 2 programming exercises (structured as open questions) based on MapReduce- and Spark-based programming (27 points) - 2 multiple choice questions on all the topics addressed during the course (4 points). The programming exercises aim at evaluating the ability of the students to write distributed programs to analyze big data by means of the programming paradigms that are introduced in the course. The multiple choice questions are used to evaluate the knowledge of the theoretical concepts of the course and in particular the knowledge of the characteristics of the main technological infrastructures and distributed systems (Hadoop and Spark) that are used to deal with big data. The evaluation of the programming exercises is based on the correctness and efficiency of the proposed solutions. For each multiple choice question, the students achieve two points if the answer is correct and zero points if the answer is wrong or missing. The exam is closed book. - Books, notes, and any other paper material are not allowed. - Electronic devices of any kind (PC, laptop mobile phone, calculators, etc.), apart from the PC used to take the test, are not allowed. The exam is passed if the mark of the test is greater than or equal to 18 points.


© Politecnico di Torino
Corso Duca degli Abruzzi, 24 - 10129 Torino, ITALY
Contatti