Traditional data analytic and distributed systems are no more adequate in the big data era. Hence, to efficiently extract relevant knowledge from the big amount of available heterogeneous data, novel data models, programming paradigms, and distributed frameworks are needed. The course addresses the data analytics challenges arising in the Big Data era. Specifically, the course will cover the entire big data processing pipeline, by introducing the state of the art distributed frameworks for big data (e.g., Hadoop and Spark) and the current programming paradigms (e.g., MapReduce, Spark RDDs) that are used to analyze and extract knowledge from big data, also by means of distributed machine learning algorithms. The course will also cover the streaming data analytics topic by means of the state of the art frameworks (e.g., Kafka, Spark streaming, and Storm).
Traditional data analytic and distributed systems are no more adequate in the big data era. Hence, to efficiently extract relevant knowledge from the big amount of available heterogeneous data, novel data models, programming paradigms, and distributed frameworks are needed. The course addresses the data analytics challenges arising in the Big Data era. Specifically, the course will cover the entire big data processing pipeline, by introducing the state of the art distributed frameworks for big data (e.g., Hadoop and Spark) and the current programming paradigms (e.g., MapReduce, Spark RDDs) that are used to analyze and extract knowledge from big data, also by means of distributed machine learning algorithms. The course will also cover the streaming data analytics topic by means of the state of the art frameworks (e.g., Kafka, Spark streaming, and Storm).
The course aims at providing:
• Knowledge of the main problems and opportunities arising in the big data context and technological characteristics of the infrastructures and distributed systems used to deal with big data (e.g., Hadoop and Spark).
• Ability to write distributed programs to process and analyze big data by means of novel programming paradigms: the Map Reduce and Spark programming paradigms
• Ability to write distributed programs to process and analyze streaming data
• Knowledge of state of the art machine learning library for big data (e.g., MLlib)
• Ability to design a big data pipeline
The course aims at providing:
• Knowledge of the main problems and opportunities arising in the big data context and technological characteristics of the infrastructures and distributed systems used to deal with big data (e.g., Hadoop and Spark).
• Ability to write distributed programs to process and analyze big data by means of novel programming paradigms: the Map Reduce and Spark programming paradigms
• Ability to write distributed programs to process and analyze streaming data
• Knowledge of state of the art machine learning library for big data (e.g., MLlib)
• Ability to design a big data pipeline
• Object-oriented programming skills
• Knowledge of standard centralized machine learning algorithms
• Python language
• Basic Object-oriented programming skills
• Basic knowledge of the SQL language and the relational data model
• Knowledge of standard centralized machine learning algorithms
Lectures and classroom exercises (62 hours)
• Introduction to Big data: characteristics, problems, opportunities (3 hours)
• Design of a big data pipeline (3 hours)
• Hadoop and its ecosystem: infrastructure and basic components (4.5 hours)
• Map Reduce programming paradigm (9 hours)
• Spark Architecture and RDD-based programming paradigm (21.5 hours)
• Data streaming analytics (Kafka, Spark Streaming, Storm) (12 hours)
• Data mining and Machine learning libraries: MLlib (9 hours)
Laboratory activities (18 hours)
• Development of big data applications by means of Hadoop and Spark (18 hours)
Lectures and classroom exercises (62 hours)
• Introduction to Big data: characteristics, problems, opportunities (3 hours)
• Big data architectures (3 hours)
• Hadoop and its ecosystem: infrastructure and basic components (3 hours)
• Map Reduce programming paradigm (9 hours)
• Spark Architecture and RDD-based programming paradigm (14 hours)
• Spark SQL and DataFrames (12 hours)
• Data mining and Machine learning libraries: MLlib (4.5 hours)
• Graph analytics: Spark GraphX and GraphFrame (4.5 hours)
• Data streaming analytics (Kafka, Spark Streaming, Storm) (9 hours)
Laboratory activities (18 hours)
• Development of big data applications by means of Hadoop and Spark (18 hours)
The course consists of Lectures and classroom exercises (62 hours) and Laboratory sessions (18 hours). The laboratory sessions are focused on the main topics of the course (MapReduce, Spark, and MLlib). The Laboratory sessions allow experimental activities on the most widespread open-source and state of the art big data frameworks.
The course consists of Lectures and classroom exercises (62 hours) and Laboratory sessions (18 hours). The laboratory sessions are focused on the main topics of the course (MapReduce, Spark, and MLlib). The Laboratory sessions allow experimental activities on the most widespread open-source and state of the art big data frameworks.
Copies of the slides used during the lectures, examples of written exams and exercises, and manuals for the activities in the laboratory will be made available. All teaching material is downloadable from the course website and the teaching portal.
Reference books:
• Matei Zaharia, Bill Chambers. Spark: The Definitive Guide (Big Data Processing Made Simple). O'Reilly Media, 2018.
• Advanced Analytics and Real-Time Data Processing in Apache Spark. Packt Publishing, 2018.
• Tom White. Hadoop, The Definitive Guide. (Third edition). O'Reilly Media, 2015.
• Matei Zaharia, Holden Karau, Andy Konwinski, Patrick Wendell. Learning Spark (Lightning-Fast Big Data Analytics). O’Reilly, 2015.
Copies of the slides used during the lectures, examples of written exams and exercises, and manuals for the activities in the laboratory will be made available. All teaching material is downloadable from the course website and the teaching portal.
Reference books:
• Matei Zaharia, Bill Chambers. Spark: The Definitive Guide (Big Data Processing Made Simple). O'Reilly Media, 2018.
• Advanced Analytics and Real-Time Data Processing in Apache Spark. Packt Publishing, 2018.
• Tom White. Hadoop, The Definitive Guide. (Third edition). O'Reilly Media, 2015.
• Matei Zaharia, Holden Karau, Andy Konwinski, Patrick Wendell. Learning Spark (Lightning-Fast Big Data Analytics). O’Reilly, 2015.
Slides; Esercizi; Esercizi risolti; Esercitazioni di laboratorio; Esercitazioni di laboratorio risolte; Video lezioni dell’anno corrente; Video lezioni tratte da anni precedenti;
Lecture slides; Exercises; Exercise with solutions ; Lab exercises; Lab exercises with solutions; Video lectures (current year); Video lectures (previous years);
E' possibile sostenere l’esame in anticipo rispetto all’acquisizione della frequenza
You can take this exam before attending the course
Modalità di esame: Prova scritta in aula tramite PC con l'utilizzo della piattaforma di ateneo;
Exam: Computer-based written test in class using POLITO platform;
...
The exam aims at assessing (i) the ability of the students to write distributed programs to process and analyze big data by means of novel programming paradigms (MapReduce and Spark-RDD based programming paradigms) and (ii) the knowledge of the students of the main issues related to the big data topic and the technological infrastructures and distributed systems that are used to deal with big data.
The exam consists of a written exam that lasts 2 hours.
Specifically, the written exam is composed of two parts:
- 2 programming exercises (Map Reduce- and RDD-based programming) (27 points)
- 2 multiple choice questions on all the topics addressed during the course (4 points).
The programming exercises aim at evaluating the ability of the students to write distributed programs to analyze big data by means of the programming paradigms that are introduced in the course.
The multiple choice questions are used to evaluate the knowledge of the theoretical concepts of the course and in particular the knowledge of the characteristics of the main technological infrastructures and distributed systems (Hadoop and Spark) that are used to deal with big data.
The evaluation of the programming exercises is based on the correctness and efficiency of the proposed solutions.
For each multiple choice question, the students achieve two points if the answer is correct and zero points if the answer is wrong or missing.
The exam is open book (notes and books can be used during the exam).
The exam is passed if the mark of the written exam is greater than or equal to 18 points.
Gli studenti e le studentesse con disabilità o con Disturbi Specifici di Apprendimento (DSA), oltre alla segnalazione tramite procedura informatizzata, sono invitati a comunicare anche direttamente al/la docente titolare dell'insegnamento, con un preavviso non inferiore ad una settimana dall'avvio della sessione d'esame, gli strumenti compensativi concordati con l'Unità Special Needs, al fine di permettere al/la docente la declinazione più idonea in riferimento alla specifica tipologia di esame.
Exam: Computer-based written test in class using POLITO platform;
The exam aims at assessing (i) the ability of the students to write distributed programs to process and analyze big data using novel programming paradigms (MapReduce and Spark-RDD-based programming paradigms) and (ii) the knowledge of the students of the main issues related to the big data topic and the technological infrastructures and distributed systems that are used to deal with big data.
The exam consists of a written onsite PC-based exam (with the Exam platform) that lasts 1.5 hours.
Specifically, the written online test is composed of two parts:
- 2 programming exercises (structured as open questions) based on MapReduce- and Spark-based programming (27 points)
- 2 multiple-choice questions on all the topics addressed during the course (4 points).
The programming exercises aim at evaluating the ability of the students to write distributed programs to analyze big data using the programming paradigms introduced in the course.
The multiple-choice questions are used to evaluate the knowledge of the theoretical concepts of the course and, in particular, the knowledge of the main technological infrastructures and distributed systems (Hadoop and Spark) used to deal with big data.
The evaluation of the programming exercises is based on the correctness and efficiency of the proposed solutions.
For each multiple-choice question, the students achieve two points if the answer is correct and zero points if the answer is wrong or missing.
The exam is open book.
-Any electronic devices (PC, laptop, mobile phone, calculators, etc.) apart from the PC used to take the test are not allowed.
The exam is passed if the mark of the written exam is greater than or equal to 18 points.
If the mark of the written exam is greater than 30, then the final mark will be "30 e lode".
In addition to the message sent by the online system, students with disabilities or Specific Learning Disorders (SLD) are invited to directly inform the professor in charge of the course about the special arrangements for the exam that have been agreed with the Special Needs Unit. The professor has to be informed at least one week before the beginning of the examination session in order to provide students with the most suitable arrangements for each specific type of exam.